How many threads (or work-item) can run at the same time?

Laure Jonchery picture Laure Jonchery · Apr 15, 2011 · Viewed 15.2k times · Source

I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL.

My question was how to compute the limit of a GPU device (in number of threads).
From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread).

  • How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group?

  • To what CL_DEVICE_MAX_COMPUTE_UNITS corresponds?
    The khronos specification speeks of cores ("The number of parallel compute cores on the OpenCL device.") what is the difference with the CUDA core given in the specification of my graphic card. In my case openCL gives 14 and my GeForce 8800 GT has 112 core based on NVIDIA website.

  • Does CL_DEVICE_MAX_WORK_GROUP_SIZE (512 in my case) corresponds to the total of work-items given to a specific work-group or the number of work-item that can run at the same time in a work-group?

Any suggestions would be extremely appreciated.

Answer

Eric Bainville picture Eric Bainville · Apr 18, 2011

The OpenCL standard does not specify how the abstract execution model provided by OpenCL is mapped to the hardware. You can enqueue any number T of threads (work items), and provide a workgroup size (WG), with at least the following constraints (see OpenCL spec 5.7.3 and 5.8 for details):

  • WG must divide T
  • WG must be at most DEVICE_MAX_WORK_GROUP_SIZE
  • WG must be at most KERNEL_WORK_GROUP_SIZE returned by GetKernelWorkGroupInfo ; it may be smaller than the device max workgroup size if the kernel consumes a lot of resources.

The implementation manages the execution of the kernel on the hardware. All threads of a single workgroup must be scheduled on a single "multiprocessor", but a single multiprocessor can manage several workgroups at the same time.

Threads inside a workgroup are executed by groups of 32 (NVIDIA warp) or 64 (AMD wavefront). Each micro-architecture does this in a different way. You will find more details in NVIDIA and AMD forums, and in the various docs provided by each vendor.

To answer your question: there is no limit to the number of threads. In the real world, your problem is limited by the size of inputs/outputs, i.e. the size of the device memory. To process a 4GB buffer of float, you can enqueue 1G threads, with WG=256 for example. The device will have to schedule 4M workgroups on its small number (say between 2 and 40) of multiprocessors.