C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

A Practical Introduction to OpenCL

From: andrew cooke <andrew@...>

Date: Fri, 19 Mar 2010 17:38:21 -0400

This is an informal introduction to OpenCL, aimed at people who need to write
code for a GPU and are confused about the different concepts, and how these
map to the hardware.

I will take a very practical approach, but one that will, therefore, not
reflect the true generality of what OpenCL offers.

Disclaimer: I have worked on only 3 GPU projects, and am still learning.



Work Items
----------

Imagine that we are translating some nested loops from C code that runs on a
CPU (if your code is in Fortran, translate the loops to C first, and get that
code working, before trying to move to OpenCL - it's much easier to fix
pointer errors on a CPU).  Our code will have the following structure:

  for (int i = 0; i < n; ++i) {
    for (int j = 0; j < m; ++j) {
      loop body
    }
  }

Here we have two external loops and we evaluate the loop body for each
different value of i and j.  The code in the loop body will become our KERNEL.

This code would map onto the GPU as n*m different WORK ITEMS.  Each work item
is a separate "run" of the loop body, together with a particular value for i
and j.

It's useful to imagine a table of these work items, arranged with i going
horizontally and j going vertically.  Each entry in the table is a separate
work item, and the table as a whole is called the INDEX SPACE.

There's nothing special about 2 dimensions - we just happened to have two
loops in the original code.  Everything I am going to say works for 1, 2 or 3
dimensions (2 dimensions is easiest to explain, so I use that here).

I'll say more about decoupling index space from the actual values of n and m
later; the above is enough to get started.


Hardware
--------

Before we can do anything with the work items we need to understand the
hardware.

The smallest hardware unit is a PROCESSING ELEMENT.  A processing element, at
any moment in time, can work on only a single thread.  It's something like an
ALU in a CPU.  A processing element has its own memory - something like
registers - called PRIVATE MEMORY.  You don't access private memory
explicitly, it's where variables in your kernel are stored.

Processing elements are grouped into COMPUTE UNITS.  Typically a compute unit
contains about 16 processing elements.  Each compute unit has its own memory,
which can be accessed from all the processing elements within that unit,
called LOCAL MEMORY.

Although processing elements can do basic calculations, they are pretty dumb.
The compute unit takes care of a lot of the scheduling work.  That means that
all processing elements have to step through the same code, line by line,
together.

So it's not that useful to think of a compute unit as a multiple core CPU,
because each core in a CPU has much more independence.  Instead, think of a
compute unit as a single core with many ALUs - at this level it's targeted at
SIMD computing.

In addition, a GPU card (a COMPUTE DEVICE) contains GLOBAL MEMORY.  Typically
global memory is of order GB, while local memory is of order KB.

The global memory is loaded with data before a kernel is invoked; once
processing finishes results are read back from the same memory (this usually
happens over PCI).


Allocating Work to Hardware
---------------------------

OpenCL provides control over how we map work items to compute units.  This is
important if we want to share data between work items, because they can use
the local memory to do so efficiently.

You might think that you are restricted to the number of processing elements
when allocating work items to a compute unit - but if so, you are wrong!  For
best efficiency you usually want to have many more work items than processing
elements.  This is because the compute unit will switch between items when
processing elements stall reading from global memory.

[This is an import difference between GPUs and CPUs.  A CPU tries to avoid
waiting for memory reads by always having data in the local cache.  A compute
unit, in contrast, switches to another thread, and hopes to stay busy while
waiting for the data.]

At the same time, for perfect use of resources, the number of work items given
to a compute unit should be an exact multiple of the number of processing
elements it contains, because otherwise some will spend time "empty".
However, this is not a requirement (and, as long as you have a large enough
number of items, is not a serious issue).

So we group work items into WORK GROUPS.  One work group is a collection of
work items associated with a particular compute unit.  In a broad sense, we
can think of all the work items within a work group running "together",
although at a less abstract level we know that if we have more work items than
processing elements then a single processing element will actually switch
between several different work items, as necessary.

Returning to the index space, work groups must "tile" that space.  So in 2
dimensions we might have 6 work groups, dividing our imaginary table of work
items by 2 horizontally and 3 vertically.  Note that all work groups must
contain the same number of work items (have the same shape and size), and they
must fit exactly within the index space.

The last requirement - that the index space must be an exact multiple of the
work group along each dimension - is not as worrying as it sounds.  I will
explain how the size of the index space can (usually) be decoupled from the
underlying problem space (n, m) later.


Writing the Kernel
------------------

Given the above, we can take our index space, divide it into regular pieces to
create work groups, and send the work groups to the compute units.  But what
we haven't yet addressed is how an individual work item, when it executes
within a processing element, knows its own location in the index space.

In other words, each work item is going to run the "loop body" in our original
example, so each work item needs a different value of i and j.  To do this,
the kernel code starts with:

  int i = get_global_id(0);
  int j = get_global_id(1);
  loop body

where get_global_id(dim) is a function provided by OpenCL that gives us the
value we need.

Also, if we want to communicate between work items in a common work group
(making use of local memory) we need a way to identify each work item within
that group.  This is done with get_local_id(dim).


Decoupling Index Space
----------------------

So far we have assumed that the index space covers the range 0..n in the first
dimension and 0..m in the second.  That fits naturally with the original code,
but can cause problems.  For example, if n is a prime number then there is no
way to divide the index space into separate work groups (except for dividing
that dimension into as many workgroups as there are values, in which case
there would be a separate work group for every distinct value of i - sometimes
this is useful, but often it makes the work groups too small).

Luckily there's a neat trick that lets us choose an arbitrary size for the
index space - we simply repeat *within* the kernel as necessary.  It's easier
to show the kernel code than explain this in detail:

  for (int i = get_global_id(0); i < n; i += get_global_size(0)) {
    for (int j = get_global_id(1); j < m; j += get_global_size(1)) {
      loop body
    }
  }

Where get_global_size(dim) is the size of the index space.  This takes any
index space and tiles *that* over the problem space!


Conditional Code
----------------

A quick aside to avoid confusion below.

Earlier I said that each processing element was "pretty dumb" and that "all
processing elements [within a particular compute unit] have to step through
the same code, line by line".

That raise an obvious question - what happens with code like:

  if (i % 2) then {
    foo = bar;
  } else {
    foo = baz;
  }

If "i" differs by 1 for each work unit then half of the processing elements
should assign bar to foo and half should assign baz.  This cannot happen if
they are all executing the same code "line by line".

In fact, processing elements are effectively turned on and off individually
for each line (more exactly, each instruction).  So the code above is
translated into something more like:

  if (i % 2) foo = bar
  if (! (i % 2)) foo = baz

Which allows the kernel to handle conditionals, but also wastes cycles (since
half the processing elements are idle on each line).


Efficiency
----------

Efficiency comes mainly from reducing the cost of reading global memory.  In
particular, to COALESCE reads.  To understand how this works you need to put
together two things.

First, you need to understand that there's a big difference between addressing
memory and reading it.  Addressing a particular point in global memory is
slow.  But once you have "selected" that point, reading data from successive
memory locations is actually pretty damn fast.  So the "trick" is to make sure
that when you need to read several bytes, they are neighbours.

Second, you need to understand exactly which requests can be joined in this
way.  We are *not* talking about successive reads from a single thread.
Instead, we need to consider all the processing elements in a compute unit.
Remember that they are all running through the kernel code in lockstep, line
by line.  So when one reads a value, they all will (modulo conditionals as
explained above).

The trick, then, is to make your code look like this:

  float *data, foo;
  [...]
  foo = *(data + i);

where we're assuming that i "counts" the work items.  Those reads will then be
coalesced into one fast "slurp" of data.

There are many more fine details, but that's the general idea.


More
----

Of course, there's much more to OpenCL than this.

I haven't described the kernel code in any detail.  Although the "core" may be
very similar to your original "inner loop" you still need to copy across all
the pointers and local variables that the code needs.

I haven't described synchronisation and access to local memory.  And I haven't
described how multiple kernels can be invoked, and how they can depend on each
other.  Or how to use multiple devices.

OpenCL supports all this and more, but hopefully the above gives a clearer
idea of how to map a simple problem to the hardware, and how that ties in with
memory access.


Andrew

Re: A Practical Introduction to OpenCL

From: andrew cooke <andrew@...>

Date: Tue, 23 Mar 2010 09:48:36 -0400

From: Dr David Philpot <drdave@...>

Hi Andrew,

Thanks for this post.  I found it very useful.  I've just started to look 
into OpenCL and ways to make use of stream processors for non graphics 
related projects.

Will you be expanding on your post at all?
Are you able to provide some specific examples?

Cheers
David.

OpenCL Examples

From: andrew cooke <andrew@...>

Date: Tue, 23 Mar 2010 10:35:14 -0400

Hi,

Thanks for positive comments.

My current OpenCL work is "commercially sensitive", so I can't post the
source.  Having said that, I am slowly starting to put together a library of
support functions that I might be able to persuade the people paying for the
work to publish - but that will probably be months away, if at all.

From Septmeber I am taking a sabbatical (well, my partner is really the one
taking the sabbatical - I am just going along and incidentally giving up work
due to US Visa restrictions...) and then I hope to do some experimentation of
my own with GPUs.  But again that is some time away.

Some unsolicited advice: if the post made sense, you probably need to just
jump in.  I started by adapting a simple example from NVidia's toolkit.  That
and the specification -
http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf - is all you really
need.  From a distance it looks much more complex than it actually is.  Once
you start writing code and calling functions, it all becomes a lot simpler
(although, I have to admit, debugging is a pig - it's best to start with
working C code in a loop, as I described, and then move that to a kernel, so
that you are sure that the majority of the code you have is OK).

Andrew

A Practical Introduction to OpenCL

From: Dr David Philpot <drdave@...>

Date: Tue, 23 Mar 2010 21:19:11 +1000

Hi Andrew,

Thanks for this post.  I found it very useful.  I've just started to 
look into OpenCL and ways to make use of stream processors for non 
graphics related projects.

Will you be expanding on your post at all?
Are you able to provide some specific examples?

Cheers
David.

RE:

From: Ravi Basavaraj <ravi.bks@...>

Date: Tue, 20 Mar 2012 16:14:07 -0500

Thank you for the quick answer.

=20

Regarding my first question=2C below is another example which
uses get_global_size(int dim).

=20

int     x =3D get_global_id(0)=3B

int     y =3D get_global_id(1)=3B

int           num_pixels_per_workitem =3D 32=3B

int     i=2C idx=3B

  =20
for (i=3D0=2C idx=3Dx=3B i<num_pixels_per_workitem=3B i++=2C
idx+=3Dget_global_size(0)){

=20

    float4 clr =3D read_imagef(idx=2Cy)=3B //reads
the pixel value at (idx=2Cy) position

    =85.. //process the pixel.

  =20
}

=20

32 pixels are grouped as one work item so if get_global_size(0)
returns 64=2C how=20

will this kernel actually read 32 pixels of the work item by
incrementing idx by get_global_size(0)=3B

Incrementing idx by get_global_size(0) will make the
index(idx) point to the pixels values which are beyond this work time=2C ri=
ght ?
Thank you.Ravi 		 	   		   		 	   		  =

Re: Questions on OpenCL

From: andrew cooke <andrew@...>

Date: Tue, 20 Mar 2012 16:52:44 -0300

Hi,

Your email was multipart and so exceeded my ancient blog's capabilities, but
the version below will appear (hopefully) with my replies:

> On Tue, Mar 20, 2012 at 01:49:08PM -0500, Ravi Basavaraj wrote:
>  
> Hi Andrew,
> 
> This is a very neat explanation. Thank you so much. I have a couple of
> questions:
> 
> 1) In your "Decoupling Index Space" explanation,The ides behind using
> get_global_size(0)/(1) is to make sure the loop runs only once, right ? 
> Then why use the "for" loop at all ?

No!  get_global_size() returns the total number of work items.  That section
is describing what to do when the total number of things to be calculated (n,
m) is more than that (typically because they are not nice multiples of 2).

So some loops will repeat, to "fill up to the edges" the work that needs to be
done.

> 2) In your "Conditional Code" explanation, could you pleaseexplain a little
> bit more on "This cannot happen if they are all executing the same code
> "line by line".".What cannot happen if all the work items are executing the
> same code "line by line"?

You are used to code on a CPU where one thread can jump while another does not
- each thread has its own program counter.  This is not the case on a GPU.
There is a single, global "program counter" and so all work items must execute
the same code (see almso my answer below).

So we have a problem with:

  if (i % 2) then {
    foo = bar;
  } else {
    foo = baz;
  }

because every other work item wants to do a jump!  That is why the code is
translated to something more like:

  if (i % 2) foo = bar
  if (! (i % 2)) foo = baz

because the GPU *is* smart enough to be able to switch individual operations
on and off according to local data.

NOTE: I wrote this two years ago and no longer remember where I learnt this.
Perhaps I am wrong?  But it is a side detail that I don't think is important
for general understanding.

> Finally,when the processing elements execute the kernel code, will all the
> processing elements get a copy of the binary (machine code of the
> instructions) for every work item to execute or how is this managed ?

I do not know how the hardware works in detail, but it is more like there
being a single copy that all processing elements execute at the same time.
This is related to the point above: because all processing elements are doing
"the same thing" some cannot branch while others do not.

This is why it is called SIMD - single instruction, multiple data.  Because
there is a single instruction (a single copy of the machine code) for all the
compute units (all the data).

https://en.wikipedia.org/wiki/SIMD

> Thanks n Regards,Ravi

No problem,

Andrew

Comment on this post