In order to speed up my project, I want to store a value which was generated inside a kernel into shared memory. However, I found it takes such a long time to store that value. If I remove THIS LINE
(see code below), it is very fast to store that value (100 times speed-up!).
extern __shared__ int sh_try[];
__global__ void xxxKernel (...)
{
float v, e0, e1;
float t;
int count(0);
for (...)
{
v = fetchTexture();
e0 = fetchTexture();
e1 = fetchTexture();
t = someDeviceFunction(v, e0, e1);
if (t>0.0 && t < 1.0) <========== <THIS LINE>
count++;
}
sh_try[threadIdx.x] = count;
}
main()
{
sth..
START TIMING:
xxxKernel<<<gridDim.x, BlockDim.x, BlockDim.x*sizeof(int)>>> (...);
cudaDeviceSynchronize();
END TIMING.
sth...
}
In order to figure out this problem, I simplified my code that just stores the data into shared memory and stop. As I know shared memory is the most efficient memory besides registers, I wonder if this high latency is normal or if I've done something wrong. Please give me some advice! Thank you guys in advance!
trudi
Update:
When I replace shared memory with global memory, it takes almost the same amount of time, 33ms without THIS LINE
, 297ms with it. Is it normal that storing data to global memory takes the same amount of time as storing to shared memory? Is that also a 'compiler optimization'?
I have also checked other, similar problems on StackOverflow, i.e., there is a huge time gap between storing data into shared memory or not, which may be caused by compiler optimization, since it is pointless to calculate data but not store it, so the compiler just 'removed' that pointless code.
I am not sure if I share the same reason, since the line changes the game is a hypothesis - THIS LINE
, when i comment it out, the variable count
increases in every iteration, when I uncomment it, it increases when t
is meaningful.
Any ideas? Please...
nvcc -ptx mycode.cu
for the 2 cases..entry _Z6xxxKernelILi2EEvPj() {
. Afterwards, body in assembler-like code follows.someDeviceFunction
, because the computed valuet
has no effect on the behavior of your code. So the compiler can optimize out that call. To compare the ptx, just generate the ptx using the code with and without the line, and compare the difference in the length of the file or total number of instructions.