NSIGHT compute: SOL SM versus Roofline

Question

I ran cuda-11.2 nsight-compute on my cuda kernel.

It reports that SOL SM is at 79.44% which I interpret as being pretty close to maximum. SOL L1 is at 48.38%

When I examine the Roofline chart, I see that my measured result is very far away from peak performance.

Achieved: 4.7 GFlop/s.

Peak at roofline: 93 GFlop/s or so.

I also see ALU pipe utilization at 80+%

So, if the ALU pipe is fully utilized, why is the achieved performance so much lower, according to the roofline chart?

Note that this is on a RTX 3070, which peaks for single precision at 17.6 TFlop/s:

UPDATE

I think I figured out what is going on here... @robert-crovella put me on the right track showing that ALU are integer ops, thus not included. And that those are not the only operations not included!

Roofline charts only show fp32 and fp64 operations and not fp16 operations.

My code works with half-precision floats, making the Roofline chart not applicable for my code, I suspect.

I think you need to check the magnitude of the numbers in your question because they don't look even remotely realistic — talonmies, Commented Jan 9, 2021 at 1:45
@talonmies Ah, thank you. Indeed: I read the values wrong, off by a factor 1000. It was GFlops, not TFlops. Still my observation stands: The SoL says: close to max. The RoofLine says: not even remotely close to max. — Bram, Commented Jan 9, 2021 at 2:43

Robert Crovella · Accepted Answer · 2021-01-09 05:15:35Z

So, if the ALU pipe is fully utilized, why is the achieved performance so much lower, according to the roofline chart?

Because the ALU pipe has nothing to do with floating point and the roofline chart is essentially only about floating point.

As indicated in the answer I linked, the ALU pipe handles:

most integer instructions, bit manipulation instructions, and logic instructions

It's quite possible that this pipe utilization is a factor that is limiting the performance of your kernel, and as a result of that, you are running at a lower FLOPs/s rate/throughput than what might be otherwise achievable, floating-point wise (i.e. the roofline).

The items that are (FP32/FP64) floating-point related are fma, fmaheavy, fp32, and potentially Tensor. These are all at around 40% or below active, so you're not maxing out any of those pipes.

Thanks. Good points. However, 40% of peak is still far away from what the roofline is showing, where there is a much larger delta: a factor 20 for the current arithmetic density. And a much much larger factor if we look at the absolute peak performance possible. — Bram, Commented Jan 9, 2021 at 6:10

Collectives™ on Stack Overflow

NSIGHT compute: SOL SM versus Roofline

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
cuda
nsight-compute
roofline
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cudansight-computeroofline or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
cuda
nsight-compute
roofline
or ask your own question.