| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

Last 100 entries

Programming, Business, Social Contracts; Distributed Systems for Fun and Profit; XML and Scheme; Internet Radio Stations (Curated List); Solid Data About Placebos; Half of Americans Think Climate Change Is a Sign of the Apocalypse; Saturday Surf Sessions With Juvenile Delinquents; Ssh, tty, stdout and stderr; Feathers falling in a vacuum; Santiago 30m Bike Route; Mapa de Ciclovias en Santiago; How Unreliable is UDP?; SE Santiago 20m Bike Route; Cameron's Rap; Configuring libxml with Eclipse; Reducing Combinatorial Complexity With Occam - AI; Sentidos Comunes (Chilean Online Magazine); Hilary Mantel: The Assassination of Margaret Thatcher - August 6th 1983; NSA Interceptng Gmail During Delivery; General IIR Filters; What's happening with Scala?; Interesting (But Largely Illegible) Typeface; Retiring Essentialism; Poorest in UK, Poorest in N Europe; I Want To Be A Redneck!; Reverse Racism; The Lost Art Of Nomography; IBM Data Center (Photo); Interesting Account Of Gamma Hack; The Most Interesting Audiophile In The World; How did the first world war actually end?; Ky - Restaurant Santiago; The Black Dork Lives!; The UN Requires Unaninmous Decisions; LPIR - Steganography in Practice; How I Am 6; Clear Explanation of Verizon / Level 3 / Netflix; Teenage Girls; Formalising NSA Attacks; Switching Brakes (Tektro Hydraulic); Naim NAP 100 (Power Amp); AKG 550 First Impressions; Facebook manipulates emotions (no really); Map Reduce "No Longer Used" At Google; Removing RAID metadata; New Bike (Good Bike Shop, Santiago Chile); Removing APE Tags in Linux; Compiling Python 3.0 With GCC 4.8; Maven is Amazing; Generating Docs from a GitHub Wiki; Modular Shelves; Bash Best Practices; Good Emergency Gasfiter (Santiago, Chile); Readings in Recent Architecture; Roger Casement; Integrated Information Theory (Or Not); Possibly undefined macro AC_ENABLE_SHARED; Update on Charges; Sunburst Visualisation; Spectral Embeddings (Distances -> Coordinates); Introduction to Causality; Filtering To Help Colour-Blindness; ASUS 1015E-DS02 Too; Ready Player One; Writing Clear, Fast Julia Code; List of LatAm Novels; Running (for women); Building a Jenkins Plugin and a Jar (for Command Line use); Headphone Test Recordings; Causal Consistency; The Quest for Randomness; Chat Wars; Real-life Financial Co Without ACID Database...; Flexible Muscle-Based Locomotion for Bipedal Creatures; SQL Performance Explained; The Little Manual of API Design; Multiple Word Sizes; CRC - Next Steps; FizzBuzz; Update on CRCs; Decent Links / Discussion Community; Automated Reasoning About LLVM Optimizations and Undefined Behavior; A Painless Guide To CRC Error Detection Algorithms; Tests in Julia; Dave Eggers: what's so funny about peace, love and Starship?; Cello - High Level C Programming; autoreconf needs tar; Will Self Goes To Heathrow; Top 5 BioInformatics Papers; Vasovagal Response; Good Food in Vina; Chilean Drug Criminals Use Subsitution Cipher; Adrenaline; Stiglitz on the Impact of Technology; Why Not; How I Am 5; Lenovo X240 OpenSuse 13.1; NSA and GCHQ - Psychological Trolls; Finite Fields in Julia (Defining Your Own Number Type); Julian Assange; Starting Qemu on OpenSuse

© 2006-2013 Andrew Cooke (site) / post authors (content).

Experience Optimising Matlab Code with OpenCL (NVidia GPGPU)

From: andrew cooke <andrew@...>

Date: Thu, 17 Dec 2009 18:07:21 -0300

I just got some solid results on work I have been doing, here and
there, to optimize calculations originally written in Matlab.  While I
can't give full details, I hope I can give enough to be useful to
others in a similar position.

To motivate the rest of this, I'll say up front that it has been a big
success (IMHO).  The "core calculation" is now 77x faster than the
original Matlab code, using only a relatively cheap graphics card
(NVidia 9800 GT).  We bought that card for proof of concept but since
our processing time is now dominated by other factors it's not clear
we need anything better.


One of the details I cannot describe is the actual calculation, except
to say that at first sight it looks like it would be "trivially
parallelizable" because it's a bunch of nested loops that read from a
huge data structure.  However, on second inspection the inner loop has
some indirection that means it's unlikely that reads will coalesce (in
simple terms when adjacent cores in the GPU read adjacent addresses in
memory it can be read in an efficient stream; this is unlikely for
us).


I did the optimisation in three stages.

First, I tried vectorizing the code within Matlab.  My understanding
was that while explicit sequences on Matlab commands run in a single
thread, individual commands may spawn work to other cores (and we have
a nice multi core Xeon box).  This did not work.  The issue may be
related to the exact structure of our problem (the indirection
described above), or perhaps we need to enable something with Matlab?
Whatever the cause, it was a big disappointment - seems to me Matlab
should be smarter than that, given the current rise of parallelism.

Second, I rewrote the inner loop in C.  No amazingly cool
optimisations, just the usual care over data layout etc.  This gave a
speedup of a factor of 6.  At this point it's probably worth
mentioning that I did most of the development work with Octave -
http://www.gnu.org/software/octave/ - to avoid inconveniences related
to Matlab licensing, geography, etc.  To get full compatibility I
needed to build the latest version from their VCS, but otherwise it
worked out nicely.  Even the C interface is the same (Octave support
the Matlab mex file conventions).  One warning though - don't expect
to get useful information on timings: you need to measure performance
changes on Matlab.

Third, I then moved that to OpenCL on the GPU.  This gave a speedup of
77 (13 relative to C).  I don't have more details yet, but I assume we
have so many threads that the latency of the incoherent reads is
amortized quite nicely.  Again I did most of the development on a
different machine (with an 8400 GPU).  And again this worked nicely -
most of the effort was simply getting it to work at all (ie not
segfault); the numbers above are for code that has not been profiled
or optimised in any way....


What more can I say?  Various random details:

- OpenCL (with NVidia drivers) has been completely solid.  The
documentation is good too.  It's a bit of a pain to debug (all the joy
of C without print statements) and it took me some time to understand
the exact relationship between OpenCL and CUDA concepts, but the gain
has clearly outweighed the pain.

- I spent much more time worrying about memory accesses than I did
about anything else.  As I said above, I still haven't profiled this,
so I may have been prematurely optimising (or at least worrying
without cause, since the results of my worrying didn't significantly
change the code), but it seems to be a general rule that GPU
processing is limited by memory reads (and that this will get worse as
the number of cores increases).

- The point above is explained quite nicely in
http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingOverview.pdf
- basically, unlike a CPU, a GPU has no cache.  When it is stalled
waiting for memory it simply switches to another thread.  This is why
it helps to have big loops.

- If you don't specify the work group size in the OpenCL EnqueueND...
call, then it appears to be set to 1!  This is absolutely crazy, if I
understand correctly, because it means that only one thread in a weft
is running (basically you want this number to be at least 32).  Having
said that, the 9800 appeared to handle this much better than then
8400, so perhaps more recent hardware is more flexible?

- Calculating a work group size automatically is non-trivial, because
it has to be a divisor of the total number of work items.  As far as I
can see there's no better way that factorising the total number and
then trying each different combination of factors to see which gets
closest to (but does not exceed) 512.  In practice, it's best to
expose this in your API (as a fallback I also have an "auto" option
that starts with the total number and throws away small factors until
the number is low enough - not brilliant, but simple) (there may be a
greedy algorithm to do this, but proof of concept code in C isn't the
time or place....) (and it's not Euclid's GCD either!)

- If you have large datasets, you need to worry about memory limits.
Current technology is 32bits, which means there's a limit of 4GB.  And
frustratingly it's in 4 separate banks (as far as I can tell you
cannot allocate more than 1/4 of the total memory on any NVidia GPU in
a single block) - you can explicitly jump between blocks in the kernel
(ie check offset and select array accordingly), but it makes life
messy.  Thankfully the next generation (Fermi, out early 2010 I
believe) has 64 bit addresses which should make this all go away (it
also has some kind of L2-like cache, which is interesting...).

- I hope to profile what I have and perhaps implement a better kernel
(I think we could exploit local memory...), but the initial results
were so good that there's little incentive to spend more time
improving what we have!

- When you're interfacing OpenCL / C with Matlab it helps to pass a
pointer back into Matlab (so that you can have persistent state across
calls).  The standard hack for doing this really does seem to be to
use a singleton matrix with a 64bit int on the Matlab side.  I found
some discussion in the Matlab support forums, and it worked for me.

- OpenCL (at least on the early/cheap hardware) is 32bit, while Matlab
uses doubles and longs.  So you need to convert your Matlab code with
single() and int32().  You also need to worry about lack of precision,
etc.

- See also http://www.acooke.org/cute/CallingOpe0.html

Andrew

Further Optimisation with OpenCL

From: andrew cooke <andrew@...>

Date: Fri, 1 Jan 2010 09:53:24 -0300

I had time to look at optimising the OpenCL code yesterday.  I tried
two approaches, and both gave significant gains, but both also have
limitations.

First, I tried changing an array use to store intermediate data from
global to local.  This means that instead of storing the data in the
main memory for the GPU, it is stored in an on-chip cache (and so is
much faster to access).  The amount of space is limited, so I also had
to restructure the program to calculate the absolute minimum amount of
data (which meant swapping the order of two loops nested inside each
other).

This worked - gave roughly a factor of 1.5 speedup on a simple test -
but the amount of memory required was too much for most use cases (the
cache is just 16kB, and even on Fermi will be only 48kB).

Second, I tried changing an array to a 2D image.  As I've described
elsewhere, the main difficulty we have is that the inner operation of
our loops contains a level of indirection - we need to read data from
an array, but not in a completely uniform, methodical manner.  This
means that access to that data is fairly slow.

An image is the abstraction used by OpenCL for texture maps.  I am
unsure exactly how NVidia implements these, but the end result is that
there is some kind of local cache that is used to improve access times
to arrays when they are used in a non-uniform way (an obvious example
is the kernel for direct convolution; originally I guess from the name
that they held texture images that are mapped to a 3D scene).

Obviously this is fairly risky - the image (in practice I actually
needed to use multiple images work around size limitation) is much
bigger than any cache memory, so whether a speedup is seen will depend
on the access pattern (you need bunched, repeated access).  Luckily,
this fits nicely with our problem, and gave a doubling of speed on our
worst performing dataset :o)

This work took a day, so the payoff is significant - twice the speed
for a day's work means that it is definitely worth investigating.

(I should admit here that I still haven't used NVidia's profiler, but
I am increasingly confident in my mental model of how the code/GPU are
working together.)

Andrew

Comment on this post