Meaning of the CL_DEVICE… parameters

Clash Royale CLAN TAG#URR8PPP
Meaning of the CL_DEVICE… parameters
I've implemented a single function that retrieves some informations related to my opencl devices, specifically I have this device:
1. Vendor NVIDIA Corporation
1. Device: GeForce GTX 1070
1.1 Hardware Version: OpenCL 1.2 CUDA
1.2 Software Version: 391.24
1.3 OpenCL C version: OpenCL C 1.2
1.4 Address bits: 64
1.5 Max Work Item Dimensions: 3
1.6 Work Item Sizes 1024 1024 64
1.7 Work group size: 1024
1.8 Parallel compute units 15
And I need to be sure I understand some of them (specifically work groups/items).
Given I have Work Item Sizes : 1024 1024 64 this means that when I instantiate the kernel I can use a total amount of 2^26 work items, is this correct? The Work group size : 1024 means, I guess, the max amount per work groups (in case I need to sue barriers etc I guess this info is useful). Not sure about the Parallel compute units because to me, given the name, this should be covered somehow in the work items, so
Work Item Sizes : 1024 1024 64
2^26
Work group size : 1024
Parallel compute units
Parallel compute units
CL_DEVICE_MAX_COMPUTE_UNITS
Work items
And one more question
Is there any relationship between Address bits and Work items?
Address bits
Work items
Thank you
1 Answer
1
What's the meaning of Parallel compute units
On CPUs, this is the amount of logical processors. On NVidias, this is the amount of "Streaming Multiprocessors", on AMD GPUs they are actually called "Compute Units".
The point of having these in OpenCL is that with some devices, you can "carve them up" by their compute units, and launch kernels independently on these units.
Given I have Work Item Sizes : 1024 1024 64 this means that when I instantiate the kernel I can use a total amount of 2^26 work items, is this correct?
Incorrect. These are the maximums of each dimension. The Work group size limit is the maximum of the multiplication of each dimension. IOW, if you have maximum "Work group size" of 1024, then you could launch e.g. [1024,1,1] or [128,8,1] or [4,16,4] but launching [2000,1,1] or [100,100,1] will fail. Go ahead and try it.
The reason for such (usually) small limit is related to barriers, but also local memory sizes (which are relatively tiny on most GPUs).
Also, it's actually explained in documentation to clEnqueueNDRangeKernel:
local_work_size
Points to an array of work_dim unsigned values that describe the number of
work-items that make up a work-group (also referred to as the size of the
work-group) that will execute the kernel specified by kernel. The total
number of work-items in a work-group is computed as local_work_size[0]
*... * local_work_size[work_dim - 1]. The total number of work-items
in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE value
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.