I ran cuda-11.2 nsight-compute on my cuda kernel.
It reports that SOL SM is at 79.44% which I interpret as being pretty close to maximum. SOL L1 is at 48.38%
When I examine the Roofline chart, I see that my measured result is very far away from peak performance.
Achieved: 4.7 GFlop/s.
Peak at roofline: 93 GFlop/s or so.
I also see ALU pipe utilization at 80+%
So, if the ALU pipe is fully utilized, why is the achieved performance so much lower, according to the roofline chart?
Note that this is on a RTX 3070, which peaks for single precision at 17.6 TFlop/s:
UPDATE
I think I figured out what is going on here... @robert-crovella put me on the right track showing that ALU are integer ops, thus not included. And that those are not the only operations not included!
Roofline charts only show fp32 and fp64 operations and not fp16 operations.
My code works with half-precision floats, making the Roofline chart not applicable for my code, I suspect.