Matrix Multiplication : Performance decreases after coalescing global memory access in CUDA

I have started recently started working with GPUs using CUDA. As a starter programme I tried to implement a simple matrix multiplication efficiently

C = AB

, Starting with the naive matrix multiplication (each thread loads all the elements of A and B for an element in C), the tiled implementation (threads collaboratively load a tile of elements from A and B in a tile in shared memory to reduce global memory traffic) provides good speed up.
However, in the tiled implementation too the access to the global memory is not in coalesced order. So, to increase performance it is better to transpose matrix B and then multiply. Below is my code,

#include<stdio.h> #include<stdlib.h> #include<cuda_runtime.h> #include <time.h> #include <sys/time.h> void querydeviceprop(); void allocate_matrix(float *h_a, float *h_b, int matDim); void verify(float *h_c, float *h_c_check, int matDim); void print_matrix(float *ha, int m,int n); void transpose_matrix(float *ha, int matDim); void mat_mul(); #define TILE_WIDTH 16 //should be equal to numThread for tiling implementation __global__ void MatrixMult_tiling(float *d_a,float *d_b,float *d_c, int dim) __shared__ float ta[TILE_WIDTH][TILE_WIDTH]; //to load one tile of A __shared__ float tb[TILE_WIDTH][TILE_WIDTH]; //to load one tile of A int bx,by,tx,ty,i,j; float res; int row, col; bx=blockIdx.x; by=blockIdx.y; tx=threadIdx.x; ty=threadIdx.y; row=by*TILE_WIDTH+ty; col=bx*TILE_WIDTH+tx; res=0; for(i=0;i<dim/TILE_WIDTH;i++) //collaboratively load the elements. Each thread loads a single element. ta[ty][tx]=d_a[row*dim+TILE_WIDTH*i+tx]; tb[ty][tx]=d_b[(ty+i*TILE_WIDTH)*dim+col]; __syncthreads(); for(j=0;j<TILE_WIDTH;j++) res=res+ta[ty][j]*tb[j][tx]; __syncthreads(); d_c[row*dim+col]=res; __global__ void MatrixMult_tiling_coalesced(float *d_a,float *d_b,float *d_c, int dim) __shared__ float ta[TILE_WIDTH][TILE_WIDTH]; //to load one tile of A __shared__ float tb[TILE_WIDTH][TILE_WIDTH]; //to load one tile of A int bx,by,tx,ty,i,j; float res; int row, col; bx=blockIdx.x; by=blockIdx.y; tx=threadIdx.x; ty=threadIdx.y; row=by*TILE_WIDTH+ty; col=bx*TILE_WIDTH+tx; res=0; for(i=0;i<dim/TILE_WIDTH;i++) //collaboratively load the elements. Each thread loads a single element. ta[ty][tx]=d_a[row*dim+TILE_WIDTH*i+tx]; tb[ty][tx]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx]; __syncthreads(); for(j=0;j<TILE_WIDTH;j++) res=res+ta[ty][j]*tb[tx][j]; __syncthreads(); d_c[row*dim+col]=res; __global__ void MatrixMult_naive(float *d_a,float *d_b,float *d_c, int dim) int row,col,i; col=blockIdx.y*blockDim.y+threadIdx.y; row=blockIdx.x*blockDim.x+threadIdx.x; float res; if(row<dim && col<dim) res=0; for(i=0;i<dim;i++) res=res+(d_a[row*dim+i]*d_b[i*dim+col]); d_c[row*dim+col]=res; int main() mat_mul(); return 0; void mat_mul() void allocate_matrix(float *h_a, float *h_b, int matDim) int i,j; // Initialize the host input vectors for (i = 0; i < matDim; i++) for(j=0;j< matDim;j++) h_a[i*matDim+j] = rand()%10; h_b[i*matDim+j] = rand()%10; void print_matrix(float *ha, int m,int n) int i, j; for(i=0;i<m;i++) for(j=0;j<n;j++) printf(" %.1f ",ha[i*m+j]); printf("n"); void transpose_matrix(float *h_a, int matDim) int i, j; int temp; for(i=0;i<matDim;i++) for(j=0;j<i;j++) temp=h_a[i*matDim+j]; h_a[i*matDim+j]=h_a[j*matDim+i]; h_a[j*matDim+i]=temp; void verify(float *h_c, float *h_c_check, int matDim) int i,j; //check the code for (i = 0; i < matDim; i++) for(j=0;j<matDim;j++) if (fabs(h_c[i*matDim+j] - h_c_check[i*matDim+j]) > 1e-5) printf("cpu : %f , gpu : %ft",h_c[i*matDim+j],h_c_check[i*matDim+j]); fprintf(stderr, "Result verification failed at element %d,%d !nn", i,j); exit(EXIT_FAILURE); printf("Test PASSEDn");

MatrixMult_tiling_coalesced and void MatrixMult_tiling are the functions with and without colalesced memroy access of the elements of B respectively.

MatrixMult_tiling_coalesced

void MatrixMult_tiling

Now, the problem is the time taken by MatrixMult_tiling_coalesced is almost double of the time taken by MatrixMult_tiling.
I understand that in the MatrixMult_tiling the eleemnts are loaded in the tiles in coalesced manner(i.e in row major order) for each tile, but the tiles are arranged in along a column whereas the the tiles in MatrixMult_tiling_coalesced are arranged along a row, so the function MatrixMult_tiling_coalesced should be faster than the other one.
But in practice I can see the opposite is true.
I will appreciate if someone can point out the reason.
Thnaks in advance..

MatrixMult_tiling_coalesced

MatrixMult_tiling

MatrixMult_tiling_coalesced

EDIT 1:
After the answer of Robert (see below) I understand the problem was in the load operation during elemntwise multiplication.

tb[ty][tx]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx];]

tb[tx][ty]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx];

and

res=res+ta[ty][j]*tb[tx][j];

res=res+ta[ty][j]*tb[j][tx];

This inclreased performance of the MatrixMult_tiling_coalesced function from 1500 ms to 1000 ms. However, the function MatrixMult_tiling takes only 879 ms. So, the coalesced routine is still slower. I don't understand where is the problem now.

MatrixMult_tiling_coalesced

MatrixMult_tiling

EDIT 2 :
I realized that in the EDIT 1, I had just moved the bank conflict problem from the elementwise multiplication to the element loading section. The folling changes in the code has no bank conflict,

tb[tx][ty]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx];

tb[ty][tx]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx];

And

res=res+ta[ty][j]*tb[j][tx];

res=res+ta[ty][j]*tb[ty][j];

But it is still slightly slower than the MatrixMult_tiling function. The MatrixMult_tiling_coalesced function takes 982 ms vs 870 ms of MatrixMult_tiling function. It should be at least similar to MatrixMult_tiling if not faster.

MatrixMult_tiling

MatrixMult_tiling_coalesced

MatrixMult_tiling

FINAL EDIT :

Edit 2 will not produce correct result. So the code with edit 1 will be optimum. Transposing the one of the multiplicand matrix is probably not a good idea. :-(

Thanks all for helping.

1 Answer
1

B certainly isn't the matrix I would transpose in C=AB. But that is neither here nor there.

B

C=AB

I'm not sure why you think:

in the tiled implementation too the access to the global memory is not in coalesced order

I don't see any lines of code in your MatrixMult_tiling that result in uncoalesced access.

MatrixMult_tiling

Just to make sure we don't trip over terminology, "coalesced" or "uncoalesced" are terms that we apply to access patterns to global memory (not shared memory). Your global memory access patterns are in these lines in your tiled kernel:

ta[ty][tx]=d_a[row*dim+TILE_WIDTH*i+tx]; tb[ty][tx]=d_b[(ty+i*TILE_WIDTH)*dim+col]; ... d_c[row*dim+col]=res;

and none of those patterns to global memory are uncoalesced. In each of the generated indices into d_a, d_b and d_c, if you perform the substitutions, you will find that the threadIdx.x variable is present in all of them and is not multiplied by any value, constant or otherwise. Therefore these patterns will all coalesce (nicely).

d_a

d_b

d_c

threadIdx.x

I will appreciate if someone can point out the reason.

You have done something bad when it comes to shared memory.

In your tiled kernel, your multiplication operation looks like this:

res=res+ta[ty][j]*tb[j][tx];

For this case:

ta[ty][j]

we have a situation where all threads in the warp (which have linearly increasing tx values but the same ty value) are reading the same location in shared memory. This is an "optimal" access pattern - it does not present any bank conflicts, and will be serviced in the shortest possible time.

tx

ty

For this case:

tb[j][tx]

we have a situation where adjacent threads in the warp are reading adjacent locations in shared memory. This is also an "optimal", un-bank-conflicted pattern, and will be serviced in the shortest possible time.

However in your MatrixMult_tiling_coalesced kernel, the corresponding multiplication operation is:

MatrixMult_tiling_coalesced

res=res+ta[ty][j]*tb[tx][j];

Again, with this case:

ta[ty][j]

we have a shared memory "broadcast" pattern (all threads in a warp read from the same location) which is optimal and fast. But in this case:

tb[tx][j]

you have actually created columnar access into shared memory. This is the worst possible access pattern for shared memory, and it will result in a 32-way serialization (or possibly 16-way serialization, in the case of your 16x16 threadblocks) of the load process, and definitely worse performance. Why? Remember that for a given load, j is constant across the warp, and tx increases linearly across the warp. Therefore, let's say j is 1 on a particular loop iteration. The threads in warp 0 will read:

j

tx

j

tb[0][1], tb[1][1], tb[2][1], tb[3][1], ...

and these locations all belong to a particular "column" of shared memory, i.e. they all belong to the same shared memory bank. This is the worst-case pattern for shared memory.

For completeness, I claim that all of your global memory access patterns in your MatrixMult_tiling_coalesced kernel are also coalesced:

MatrixMult_tiling_coalesced

ta[ty][tx]=d_a[row*dim+TILE_WIDTH*i+tx]; tb[ty][tx]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx]; ... d_c[row*dim+col]=res;

so there should be no major difference in the global memory access pattern/activity/efficiency, between your two kernel implementations.

As a side note, I assume this is all a learning exercise. If you are interested in high-performance matrix multiply on the GPU, I would encourage you to consider using CUBLAS.

Thank you very much for your answer. The shared memory access pattern was the problem. I have changed it such that it avoids the memory bank conflicts (please see the edit above). However, still the MatrixMult_tiling_coalesced is slower than the MatrixMult_tiling routine. Is there still some other problem.
– Rick
Aug 6 at 14:55

From the edit to your question, this: tb[tx][ty]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx]; is now also a bank-conflicted load pattern (into tb), for exactly the same reason as described in my answer. I wouldn't suggest doing that either. Perhaps what you're discovering here is that the transpose operation you think might be a good idea perhaps is not, for the shared memory ("tiled") implementation. The gyrations you are going through are not helping.
– Robert Crovella
Aug 6 at 15:14

tb[tx][ty]=d_b[bx*TILE_WIDTH*dim + TILE_WIDTH*i+ty*dim+tx];

tb

Hi @Robert sorry for the inconvenience. Yes, I realized that just after posting the comment. I had just moved the bank conflict problem from the elementwise multiplication to the element loading section. However, I have just posted another edit which has neither bank conflict still the code is little bit slower than the MatrixMult_tiling function. Again, I thank you for your patience.
– Rick
Aug 6 at 15:21

MatrixMult_tiling

To be clear, every load or store operation in the original tiled shared memory implementation is optimal. All of the global loads and stores are coalesced, and for shared memory, they are either adjacent or broadcast accesses. I'm not sure how you think you are going to improve on that by transposing.
– Robert Crovella
Aug 6 at 15:22

Yes I understood that. It's my inexperience that made me overlook this. Anyway, thank you again for walking me through the fundamentals of the shared memory. However, even if I remove all the bank conflicts (EDIT 2) the performance is still slower. I will be grateful if you can look into that. Again thank you for your help. :-)
– Rick
Aug 6 at 15:37

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

Matrix Multiplication : Performance decreases after coalescing global memory access in CUDA

Matrix Multiplication : Performance decreases after coalescing global memory access in CUDA

1 Answer
1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

Matrix Multiplication : Performance decreases after coalescing global memory access in CUDA

Matrix Multiplication : Performance decreases after coalescing global memory access in CUDA

1 Answer 1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

1 Answer
1