normalize a matrix colwise on the GPU

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



normalize a matrix colwise on the GPU



I try to normalize a matrix colwise on the GPU with good performance. I wrote this function:


__global__ void test(float * x){ // data of Matrix3Xf x
int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < colSize) // stay within the limits x

int col = id * 3;
float norm = 1 / norm3df(x[col], x[col + 1], x[col + 2]);

x[col] = x[col] * norm;
x[col + 1] = x[col + 1] * norm;
x[col + 2] = x[col + 2] * norm;



I found the data type float3 which is in the Cg Toolkit, but it's not up to date anymore (c.f. https://developer.nvidia.com/cg-toolkit) ... any ideas how to make it faster? My environment is Visual Studio 2017 and CUDA 9.2.



Thanks in advance.





Providing suggestions about making something faster requires understanding how fast it is now, and you have not provided anything like enough information to know that. Can you please edit a proper Minimal, Complete, and Verifiable example along with your performance measurements into your question, otherwise I doubt you will get a useful answer
– talonmies
Aug 8 at 14:11




2 Answers
2



Your kernel looks good, so your code may not be computationally bounded.



Except that, I would suggest to do the memory coalescing on your x. That said, store your 3-by-colSize matrix x row-wisely. And in the kernel, do things like:


x


colSize


x


__global__ void test(float * x, int colSize) // data of Matrix3Xf x

int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < colSize) // stay within the limits x

float norm = 1 / norm3df(x[id], x[id + colSize], x[id + 2*colSize]);

x[id] *= norm;
x[id + colSize] *= norm;
x[id + 2*colSize] *= norm;






Hi, this isn't working... I guess you meant this:: x[id * 3 ] *= norm; x[id * 3 + 1] *= norm; x[id * 3 + 2] *= norm; First calculate : int col = id * 3; is faster than calculating id * 3 a few times within the brackets.
– helena
Aug 10 at 12:43



this works best so far:


__global__ void test( float * x, int colSize ) // data of Matrix3Xf x

int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < colSize) // stay within the limits of x

int col = id * 3;
float norm = x[col] * x[col] + x[col + 1] * x[col + 1] + x[col + 2] * x[col + 2];
norm = rsqrtf(norm);

x[col] = x[col] * norm;
x[col + 1] = x[col + 1] * norm;
x[col + 2] = x[col + 2] * norm;







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard