CUDA, low performance in storing data in shared memroy

Question

In order to speed up my project, I want to store a value which was generated inside a kernel into shared memory. However, I found it takes such a long time to store that value. If I remove THIS LINE (see code below), it is very fast to store that value (100 times speed-up!).

extern __shared__ int sh_try[];

__global__ void xxxKernel (...)
{
  float v, e0, e1;
  float t;
  int count(0);
  for (...)
  {
     v = fetchTexture();
     e0 = fetchTexture();
     e1 = fetchTexture();
     t = someDeviceFunction(v, e0, e1);
     if (t>0.0 && t < 1.0)  <========== <THIS LINE>
       count++;
  }
  sh_try[threadIdx.x] = count;
}

main()
{
  sth..
  START TIMING:

  xxxKernel<<<gridDim.x, BlockDim.x, BlockDim.x*sizeof(int)>>> (...);
  
  cudaDeviceSynchronize();

  END TIMING.
  sth...
}

In order to figure out this problem, I simplified my code that just stores the data into shared memory and stop. As I know shared memory is the most efficient memory besides registers, I wonder if this high latency is normal or if I've done something wrong. Please give me some advice! Thank you guys in advance!

trudi

Update: When I replace shared memory with global memory, it takes almost the same amount of time, 33ms without THIS LINE, 297ms with it. Is it normal that storing data to global memory takes the same amount of time as storing to shared memory? Is that also a 'compiler optimization'?

I have also checked other, similar problems on StackOverflow, i.e., there is a huge time gap between storing data into shared memory or not, which may be caused by compiler optimization, since it is pointless to calculate data but not store it, so the compiler just 'removed' that pointless code.

I am not sure if I share the same reason, since the line changes the game is a hypothesis - THIS LINE, when i comment it out, the variable count increases in every iteration, when I uncomment it, it increases when t is meaningful.

Any ideas? Please...

Most likely when you see a large speed difference like this when changing a single line of code, it is because the compiler was able to optimize out a big chunk of code. Since your kernel is only storing data in shared memory, it's not doing anything useful. The compiler can detect this and essentially replace it with an empty kernel. You can see the difference by looking at the code output using nvcc -ptx mycode.cu for the 2 cases. — Robert Crovella, Commented Apr 5, 2013 at 7:48
Use "@name" to notify commenters. ptx-file is somehow readable. The main thing to check is the body of your function. It should start as .entry _Z6xxxKernelILi2EEvPj() {. Afterwards, body in assembler-like code follows. — stuhlo, Commented Apr 5, 2013 at 9:40
@stuhlo, thanks for replying. Maybe i should first figure out how to compile with nvcc -ptx. I got a error 'cannot find cutil_inline.h' any idea? — trudiQ, Commented Apr 5, 2013 at 13:13
@Robert Crovella, thanks for your reply. Sorry i am new in CUDA, may i ask how to check code output using nvcc-ptx mycode.cu? — trudiQ, Commented Apr 5, 2013 at 13:15
Yes, even with the global operation, adding or removing THIS LINE allows the compiler to get rid of pieces of code. For example, without the line of code, there is no need to call someDeviceFunction, because the computed value t has no effect on the behavior of your code. So the compiler can optimize out that call. To compare the ptx, just generate the ptx using the code with and without the line, and compare the difference in the length of the file or total number of instructions. — Robert Crovella, Commented Apr 5, 2013 at 13:46

Robert Crovella · Accepted Answer · 2013-11-16 13:57:02Z

Frequently, when large performance changes are seen as a result of relatively small code changes (such as adding or deleting a line of code in a kernel), the performance changes are not due to the actual performance impact of that line of code, but are due to the compiler making different optimization decisions, which can result in wholesale additions or deletions of machine code in your kernels.

A relatively easy way to help confirm this is to look at the generated machine code. For example, if the size of the generated machine code changes substantially due to the addition or deletion of a single line of source code, it may be the case that the compiler made an optimization decision that drastically affected the code.

Although it's not machine code, for these purposes a reasonable proxy is to look at the generated PTX code, which is an intermediate code that the compiler creates.

You can generated ptx by simply adding the -ptx switch to your compile command:

nvcc -ptx mycode.cu

This will generate a file called mycode.ptx which you can inspect. Naturally if your regular compile command requires extra switches (e.g -I/path/to/include/files) then this command may require those same switches. The nvcc manual provides more information on code generation options, and there is a PTX manual to help you learn about PTX, but you may be able to get a rough idea just based on the size of the generated PTX (e.g. number of lines in the .ptx file).

Collectives™ on Stack Overflow

CUDA, low performance in storing data in shared memroy

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
cuda
gpu-shared-memory
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cudagpu-shared-memory or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
cuda
gpu-shared-memory
or ask your own question.