Experience Optimising Matlab Code with OpenCL (NVidia GPGPU)

From: andrew cooke <andrew@...>

Date: Thu, 17 Dec 2009 18:07:21 -0300

I just got some solid results on work I have been doing, here and
there, to optimize calculations originally written in Matlab.  While I
can't give full details, I hope I can give enough to be useful to
others in a similar position.

To motivate the rest of this, I'll say up front that it has been a big
success (IMHO).  The "core calculation" is now 77x faster than the
original Matlab code, using only a relatively cheap graphics card
(NVidia 9800 GT).  We bought that card for proof of concept but since
our processing time is now dominated by other factors it's not clear
we need anything better.

One of the details I cannot describe is the actual calculation, except
to say that at first sight it looks like it would be "trivially
parallelizable" because it's a bunch of nested loops that read from a
huge data structure.  However, on second inspection the inner loop has
some indirection that means it's unlikely that reads will coalesce (in
memory it can be read in an efficient stream; this is unlikely for
us).

I did the optimisation in three stages.

First, I tried vectorizing the code within Matlab.  My understanding
was that while explicit sequences on Matlab commands run in a single
thread, individual commands may spawn work to other cores (and we have
a nice multi core Xeon box).  This did not work.  The issue may be
related to the exact structure of our problem (the indirection
described above), or perhaps we need to enable something with Matlab?
Whatever the cause, it was a big disappointment - seems to me Matlab
should be smarter than that, given the current rise of parallelism.

Second, I rewrote the inner loop in C.  No amazingly cool
optimisations, just the usual care over data layout etc.  This gave a
speedup of a factor of 6.  At this point it's probably worth
mentioning that I did most of the development work with Octave -
http://www.gnu.org/software/octave/ - to avoid inconveniences related
to Matlab licensing, geography, etc.  To get full compatibility I
needed to build the latest version from their VCS, but otherwise it
worked out nicely.  Even the C interface is the same (Octave support
the Matlab mex file conventions).  One warning though - don't expect
to get useful information on timings: you need to measure performance
changes on Matlab.

Third, I then moved that to OpenCL on the GPU.  This gave a speedup of
77 (13 relative to C).  I don't have more details yet, but I assume we
have so many threads that the latency of the incoherent reads is
amortized quite nicely.  Again I did most of the development on a
different machine (with an 8400 GPU).  And again this worked nicely -
most of the effort was simply getting it to work at all (ie not
segfault); the numbers above are for code that has not been profiled
or optimised in any way....

What more can I say?  Various random details:

- OpenCL (with NVidia drivers) has been completely solid.  The
documentation is good too.  It's a bit of a pain to debug (all the joy
of C without print statements) and it took me some time to understand
the exact relationship between OpenCL and CUDA concepts, but the gain
has clearly outweighed the pain.

- I spent much more time worrying about memory accesses than I did
about anything else.  As I said above, I still haven't profiled this,
so I may have been prematurely optimising (or at least worrying
without cause, since the results of my worrying didn't significantly
change the code), but it seems to be a general rule that GPU
processing is limited by memory reads (and that this will get worse as
the number of cores increases).

- The point above is explained quite nicely in
- basically, unlike a CPU, a GPU has no cache.  When it is stalled
waiting for memory it simply switches to another thread.  This is why
it helps to have big loops.

- If you don't specify the work group size in the OpenCL EnqueueND...
call, then it appears to be set to 1!  This is absolutely crazy, if I
understand correctly, because it means that only one thread in a weft
is running (basically you want this number to be at least 32).  Having
said that, the 9800 appeared to handle this much better than then
8400, so perhaps more recent hardware is more flexible?

- Calculating a work group size automatically is non-trivial, because
it has to be a divisor of the total number of work items.  As far as I
can see there's no better way that factorising the total number and
then trying each different combination of factors to see which gets
closest to (but does not exceed) 512.  In practice, it's best to
expose this in your API (as a fallback I also have an "auto" option
that starts with the total number and throws away small factors until
the number is low enough - not brilliant, but simple) (there may be a
greedy algorithm to do this, but proof of concept code in C isn't the
time or place....) (and it's not Euclid's GCD either!)

- If you have large datasets, you need to worry about memory limits.
Current technology is 32bits, which means there's a limit of 4GB.  And
frustratingly it's in 4 separate banks (as far as I can tell you
cannot allocate more than 1/4 of the total memory on any NVidia GPU in
a single block) - you can explicitly jump between blocks in the kernel
(ie check offset and select array accordingly), but it makes life
messy.  Thankfully the next generation (Fermi, out early 2010 I
believe) has 64 bit addresses which should make this all go away (it
also has some kind of L2-like cache, which is interesting...).

- I hope to profile what I have and perhaps implement a better kernel
(I think we could exploit local memory...), but the initial results
were so good that there's little incentive to spend more time
improving what we have!

- When you're interfacing OpenCL / C with Matlab it helps to pass a
pointer back into Matlab (so that you can have persistent state across
calls).  The standard hack for doing this really does seem to be to
use a singleton matrix with a 64bit int on the Matlab side.  I found
some discussion in the Matlab support forums, and it worked for me.

- OpenCL (at least on the early/cheap hardware) is 32bit, while Matlab
uses doubles and longs.  So you need to convert your Matlab code with
single() and int32().  You also need to worry about lack of precision,
etc.

Andrew

nice work

From: "John Melonakos" <john.melonakos@...>

Date: Thu, 17 Dec 2009 20:36:12 -0500

Hi Andrew,

Looks like you've got some cool things cooking.  I'm with AccelerEyes
building Jacket.  There might be some possibilities for working together.
If you have some free time after the Holidays, it'd be nice to chat with you
by phone.

Best,

John

---

John Melonakos, PhD

CEO

AccelerEyes

Atlanta, GA USA

T: (770) 315-1099

W: http://www.accelereyes.com <http://www.accelereyes.com/>

_____

The information transmitted in this email is intended only for the person or
entity to which it is addressed and contains AccelerEyes LLC confidential
and proprietary information. It may also refer to such AccelerEyes LLC
confidential or proprietary information that is subject to the terms of a
confidentiality agreement.

Further Optimisation with OpenCL

From: andrew cooke <andrew@...>

Date: Fri, 1 Jan 2010 09:53:24 -0300

I had time to look at optimising the OpenCL code yesterday.  I tried
two approaches, and both gave significant gains, but both also have
limitations.

First, I tried changing an array use to store intermediate data from
global to local.  This means that instead of storing the data in the
main memory for the GPU, it is stored in an on-chip cache (and so is
much faster to access).  The amount of space is limited, so I also had
to restructure the program to calculate the absolute minimum amount of
data (which meant swapping the order of two loops nested inside each
other).

This worked - gave roughly a factor of 1.5 speedup on a simple test -
but the amount of memory required was too much for most use cases (the
cache is just 16kB, and even on Fermi will be only 48kB).

Second, I tried changing an array to a 2D image.  As I've described
elsewhere, the main difficulty we have is that the inner operation of
our loops contains a level of indirection - we need to read data from
an array, but not in a completely uniform, methodical manner.  This

An image is the abstraction used by OpenCL for texture maps.  I am
unsure exactly how NVidia implements these, but the end result is that
there is some kind of local cache that is used to improve access times
to arrays when they are used in a non-uniform way (an obvious example
is the kernel for direct convolution; originally I guess from the name
that they held texture images that are mapped to a 3D scene).

Obviously this is fairly risky - the image (in practice I actually
needed to use multiple images work around size limitation) is much
bigger than any cache memory, so whether a speedup is seen will depend
on the access pattern (you need bunched, repeated access).  Luckily,
this fits nicely with our problem, and gave a doubling of speed on our
worst performing dataset :o)

This work took a day, so the payoff is significant - twice the speed
for a day's work means that it is definitely worth investigating.

(I should admit here that I still haven't used NVidia's profiler, but
I am increasingly confident in my mental model of how the code/GPU are
working together.)

Andrew