OpenCL: Work items, Processing elements, NDRange

lambdarookie picture lambdarookie · Jan 19, 2014 · Viewed 7.2k times · Source

My classmates and me are being confronted with OpenCL for the first time. As expected, we ran into some issues. Below I summarized the issues we had and the answers we found. However, we're not sure that we got it all right, so it would be great if you guys could take a look at both our answers and the questions below them.

Why didn't we split that up into single questions?

  1. They partly relate to each other.
  2. We think these are typical beginner's questions. Those fellow students who we consulted all replied "Well, that I didn't understand either."

Work items vs. Processing elements

In most of the lectures on OpenCL that I have seen, they use the same illustration to introduce computing units and processing elements as well as work groups and work items. This has led my classmates and me to continuously confuse these concepts. Therefore we now came up with a definition that emphasizes on the fact that processing elements are very different from work items:

  • A work item is a kernel that is being executed, whereas a processing element is an abstract model that represents something that actually does computations. A work item is something that exists only temporarily in software, while a processing element abstracts something that physically exists in hardware. However, depending on the hardware and therefore depending on the OpenCL implementation, a work item might be mapped to and executed by some piece of hardware that is represented by a so-called processing element.

Question 1: Is this correct? Is there a better way to express this?

NDRange

This is how we perceive the concept of NDRange:

  • The amount of work items that are out there is being represented by the NDRange size. Commonly, this is also being referred to as the global size. However, the NDRange can be either one-, two-, or three-dimensional ("ND"):
    • A one-dimensional problem would be some computation an a linear vector. If the vector's size is 64 and there are 64 work items to process that vector, then the NDRange size equals 64.
    • A two-dimensional problem would be some computation on an image. In the case of an 1024x768 image, the NDRange size Gx would be 1024 and the NDRange size Gy would be 768. This assumes, that there are 1024x768 work items out there to process each pixel of that image. The NDRange size then equals 1024x768.
    • A three-dimensional example would be some computation on a 3D model or so. Additionally, there is NDRange size Gz.

Question 2: Once again, is this correct?

Question 3: These dimensions are simply out there for convienence right? One could simply store the color values of each pixel of an image in a linear vector of the size width * height. The same is true for any 3D problem.

Various

Question 4: We were being told that the execution of kernels (in other words: work items) could be synchronized within a work group using barrier(CLK_LOCAL_MEM_FENCE); Understood. We were also (repeatedly) being told that work groups cannot be synchronized. Alright. But then what's the use of barrier(CLK_GLOBAL_MEM_FENCE);?

Question 5: In our host program, we specify a context that consists of one or more device(s) from one of the available platforms. However, we can only enqueue kernels in a so-called command queue that is linked to exactly one device (that has to be in the context). Again: The command queue is not linked to the previously defined context, but to a single device. Right?

Answer

CaptainObvious picture CaptainObvious · Jan 20, 2014

Question 1: Almost correct. A work-item is an instance of a kernel (see paragraph 2 of section 3.2 of the standard). See also the definition of processing element from the standard:

Processing Element: A virtual scalar processor. A work-item may execute on one or more processing elements.

see also the answer I provided to that question.

Question 2 & 3: Use more than one dimensions or the exact same number of work-items than you have data elements to process depends on your problem. It's up to you and how easier the development would be. Note also that you have a constrain with ocl 1.2 and below which forces you to have the global size a multiple of the work-group size (removed with ocl 2.0).

Question 4: Yes, synchronization during the execution of a kernel is only possible within a work-group thanks to barriers. The difference between the flags you pass as parameter refer to the type of memory. With CLK_LOCAL_MEM_FENCE all work-items will have to make sure that data they have to write in local memory will be visible to the others. With CLK_GLOBAL_MEM_FENCE it's the same but for global memory

Question 5: Within a context you can have several devices having themselves several command queues. As you stated, a command-queue is linked to one device, but you can enqueue your kernels in different command-queues from different devices. Note that if two command-queues try to access the same memory object (without sync) you get an undefined behavior. You'd typically use two or more command queues when their respective jobs are not related.

However you can synchronized command-queues through events and as a matter of fact you can also create your own events (called user events) see section 5.9 for event and section 5.10 for user events (of the standard).

I'd advice you to read at least the first chapters (1 to 5) of the standard. If you're in a hurry, at least the chap 2 which is actually the glossary.