Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
18 views

The curious gap in time cost for QKV computation in LLM inference

I use Nsight System to profile the LLM inference process in HuggingFace Transformers framework. I observe that time for q_proj, k_proj and v_proj varies significantly. As far as I know, the Q, K ...
CarryPls's user avatar
0 votes
1 answer
61 views

What do the Instruction Statistics fields in Nsight Compute mean? How do they relate to elapsed cycles?

In my example, what is the meaning of 'Executed Instructions'? According to the literal meaning, it would mean how many instruction have been executed. But how does it relate to the total run time (...
sorfkc's user avatar
  • 13
0 votes
1 answer
96 views

How can I create a container in which to use the Nvidia Nsight Systems graphical interface?

I am looking to create a container in which I can work with the graphical interface of the Nvidia Nsight Systems tool, to be able to obtain application reports with cuda and python, I have found ...
Gota_12's user avatar
  • 23
1 vote
1 answer
304 views

How to check my tensor core occupancy and utilization by Nsight Compute?

In my cuda program, I use many tensor cores operations like m8n8k4 and even use cusparseSpMV. However, when checking the ncu report, it shows like this: There is no active tensors in my program. The ...
Severus Snape's user avatar
1 vote
0 answers
149 views

How to generate a roofline analysis by Nsight?

When I was trying to analyze the performance of a kernel, I used ncu command to generate a report. However, it didn't display the roofline analysis under the section "GPU Speed of Light Troughput&...
Severus Snape's user avatar
0 votes
0 answers
26 views

Nsight Compute + Roofline chart

I am new to using Nsight Compute and have a question about the roofline chart. When I profile different kernels on Nsight Compute and view their roofline charts, nothing is shown for some kernels, ...
Sahar M's user avatar
0 votes
0 answers
76 views

Imcompatible Qt library when running nsight compute ncu-ui

I am using Ubuntu 22.04 (x86_64 architecture) and my goal is to run NVIDIA Nsight Compute ncu-ui command to visualize some GPU performance profiling outcomes. When I run ncu-ui, the following message ...
chchien's user avatar
1 vote
1 answer
714 views

how to get CUDA syntax highlighting in Nsight VSCode extension when when cuda toolkit installed by Conda?

I'm using Fedora 39 and installed cudatoolkit using conda install in a conda env (not base). When inside the conda env, I can do nvcc foo.cu && ./a.out and it works fine. (when I do which nvcc,...
xdavidliu's user avatar
  • 3,017
3 votes
0 answers
116 views

Compute and Data transfer not happening concurrently in cuda Streams on Iteration 2

I have written a basic program where a chunk of data is loaded in CPU memory (Pinned), and then I transfer it in chunks to GPUs (Asynchronously), and then do computation on each chunk. So for each ...
Lokesh's user avatar
  • 31
1 vote
1 answer
366 views

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to ...
상현박's user avatar
1 vote
0 answers
131 views

How to debug shader (OpenGL) per pixel in nsight graphic like render doc

In renderdoc, It's easy .But this feature doesn't support Opengl So I try to use Nsight. But I don't know how to do can I reproduce this operation in nsight. Its interface is too complex,and I also ...
bad apple's user avatar
0 votes
0 answers
102 views

CUDA profiling aten::mul Does it only include calculation time or does it include time to access memory?

The following results were obtained by pytorch cuda profiling. ---------------------------------------------------------------------------- Name Self CPU% ... Self CUDA ...
kmkm's user avatar
  • 1
0 votes
0 answers
183 views

How do simple warps causing low warp occupansy and high register usage?

During the warp occupancy investigation of my gbuffer pass, I found even if I simplify the scene and the shader, the nsight still reports a very low warp occupancy, or even much lower than the ...
painkiller's user avatar
1 vote
0 answers
277 views

Can NVIDIA Nsight still be used to debug shaders?

Numerous online resources claim it is possible to debug OpenGL shaders using NVIDIA Nsight Visual Studio Edition. Here is an old video of it being done. However, the Nsight VSE page mentions "the ...
Leon Frickenschmidt's user avatar
1 vote
0 answers
46 views

How to capture a bake program running without rendering window using NSight

I'm writing a DXR baker program, since it's just a baker generating ray traced results to a buffer, I didn't write any rendering window for this baker. It just keep calling DXR's dispatchRay API in a ...
Wood's user avatar
  • 975
1 vote
1 answer
271 views

CUDA math function register usage

I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated. I'm ...
Chris Uchytil's user avatar
2 votes
1 answer
660 views

Roofline Model with CUDA Manual vs. Nsight Compute

I have a very simple vector addition kernel written for CUDA. I want to calculate the arithmetic intensity as well as GFLOP/s for this Kernel. The values I calculate differ visibly from the values ...
Cherry Toska's user avatar
1 vote
1 answer
440 views

Power Usage Profiling in Nsight?

New to Nsight and GPU programming. I need a way to evaluate the affect my code has on power usage in the GPU. This article from 2013 shows that the feature was part of Nsight's toolset at some point, ...
Lauren Vk's user avatar
0 votes
2 answers
10k views

Nsys CLI profiling guidance

I am just entering into the CUDA development world and now trying to profile my code. Expected to run the nvprof tool for profiling, but get the following error: ======== Warning: This version of ...
dru10's user avatar
  • 33
0 votes
1 answer
902 views

How to use ncu command to profile average time/usage/etc for a kernel repeating 10 times?

For example, I have a test program for 5 kernels: int main() { for (int i = 0; i < 10; i++){ kernel_1<<<...>>>(...); // warm up } for (int i = 0; i < 10; i++...
thanksarose's user avatar
1 vote
1 answer
2k views

Trouble using Nsight Compute on Google Colab: 'command not found' error with ncu and installation script error with Nsight Compute

I am trying to use ncu on Colab, however when I type ncu /bin/bash: ncu: command not found A few days ago this command was working fine, I am unsure if I am making some mistakes in the code or if it ...
Alessandro Bossi's user avatar
1 vote
2 answers
1k views

How to get average execution time of CUDA kernel using NSight Systems or NSight Compute

Suppose I have a simple CLI test app named "Foo". This app executes a kernel "Bar" 100 times in a loop. How may I obtain an average kernel execution time for Bar, using Nsight ...
Tyson Hilmer's user avatar
0 votes
1 answer
760 views

Error in profiling shared memory atomic kernel in Nsight Compute

I am trying the global atomics vs shared atomics code from NVIDIA blog https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/ But when I am trying to profile with ...
yolo_ML's user avatar
  • 15
0 votes
1 answer
3k views

Failed to open dynamic library RTSSVkLayer64.dll when using NVIDIA NSight Graphic to debug app

I am writing some tiny game app using Rust and use Vulkan as the graphic api. It is perfect to debug my app in RenderDoc but something went wrong when I am trying to debug my app in NVIDIA NSight. ...
CrystaLamb's user avatar
1 vote
2 answers
567 views

How do I profile OpenMP offloading code compiled by clang

I am currently working with OpenMP offloading using LLVM/clang-16 (built from the github repository). Using the built-in profiling tools in clang (using environment variables such as ...
Dogyman's user avatar
  • 31
0 votes
1 answer
376 views

How to use CUPTI to get metrics related to Launch Metrics, Source Metrics and Instructions Per Opcode Metrics

I am able to use ncu to get the metrics related to Launch Metrics, Source Metrics and Instructions Per Opcode Metrics (found here). However I am unable to use CUPTI to get the values after modifying ...
BoringSession's user avatar
0 votes
1 answer
313 views

Command to run callback_profiling sample from CUPTI

I am running the sample code available for Nvidia CUDA CUPTI in /usr/local/cuda-11.8/extras/CUPTI/samples/callback_profiling. There is a Makefile, but I want to run it using single command (without ...
BoringSession's user avatar
0 votes
1 answer
282 views

Difference in SASS using cuobjdump and Nsight compute

I have a simple kernel as __global__ void hello_cuda() { int a = 10; printf("hello from GPU\n"); } When I use Nsight compute to see the Source and SASS section, I see: # Address ...
BoringSession's user avatar
0 votes
1 answer
1k views

NSight Compute not showing achieved occupancy in the metrics

I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute. ncu says: Theoretical Occupancy [%] 100, and Achieved Occupancy [%] 93,04. What ...
BoringSession's user avatar
0 votes
2 answers
105 views

The number of times to run a profiling experiment

I am trying to profile a CUDA Application. I had a basic doubt about performance analysis and workload characterization of HPC programs. Let us say I want to analyse the wall clock time(the end-to-end ...
punter147's user avatar
  • 312
1 vote
1 answer
818 views

Nsight Compute profiling of a __device__ function in a kernel

I am trying to use Nsight Compute to profile kernels in my CUDA code. But how do I profile functions inside a kernel? Say for example, I have 2 functions (device functions) in a kernel (global). ...
BoringSession's user avatar
2 votes
1 answer
323 views

Can Nsight Systems use debug info URLs?

So I am on Arch Linux and the libraries from the official repositories do not ship with debug symbols. To work around this in most debugging tools, one can use DEBUGINFOD_URLS=https://debuginfod....
TIL's user avatar
  • 35
0 votes
2 answers
679 views

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by: Validating with test data over 100 runs (just in case) Using cuda-memcheck (...
forever__newbie's user avatar
0 votes
1 answer
3k views

Nsys Does not show the CUDA kernels profiling output

My system is V100 with the following information: | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.6 | NVIDIA Nsight Systems version 2021.5.2.53-28d0e6e sudo sh -c “echo 2 >/proc/...
Hossam Amer's user avatar
0 votes
1 answer
2k views

How to see NVTX markers in Nvidia Nsight Systems? With host and guest being the same Windows machine

I am trying profiling CPU/GPU applications, using Nsight suite. Currently trying to understand a stuttering problem, I added a range around the simulation step (taking place on the CPU): #include &...
Ad N's user avatar
  • 8,366
0 votes
1 answer
532 views

Cuda-gdb in vscode, Cannot find user-level thread for LWP 4077: generic error

I am trying to set up cuda programming in vs code and ran into this problem where cuda-gdb just returns an error. I tried running it with regular gdb and that works. I am using wsl. running the "...
William Hofsøy's user avatar
1 vote
1 answer
215 views

OpenGL - Is there a way to track actually used memory allocated by glBufferData / glBufferSubData?

There is a big codebase which allocates empty fixed size of GPU memory using glBufferData function, and fills/updates these empty allocated space partially using glBufferSubData. Since not all of the ...
user1559792's user avatar
1 vote
2 answers
785 views

VSCode fail to debug Cython wrapped CUDA code (but CLI cuda-gdb can)

Background: Running VSCode on Ubuntu 20.04 The following have been accomplished: (a) Compiled and build the Cython wrapper for CUDA code (packaged as shared library .so); (b) Python script importing ...
CorneliusJack's user avatar
1 vote
0 answers
1k views

How to get detailed Nvidia GPU usage?

Nvidia-smi only provides a few metrics to measure GPU utilization. Most importantly, utilization.gpu represents the percent of time over the past sample period during which one or more kernels was ...
gebbissimo's user avatar
  • 2,579
0 votes
1 answer
929 views

Nsight Graphics and RenderDoc cannot trace application

I am stuck writing a Vulkan renderer. The final output I see on the screen is only the clear color, animated over time, but no geometries. Even with all possible validation turned on I dont get any ...
Samwise's user avatar
  • 96
1 vote
1 answer
116 views

What the `ipa` pipeline is about in CUDA architecture?

When looking into ncu --query-metrics it turns out that several counters are about this ipa pipeline that isn't even cited in NSight docs, smsp__inst_executed_pipe_ipa for example. While for all of ...
nazavode's user avatar
1 vote
0 answers
316 views

Nvidia Nsight crashes when creating BLAS. What could be the cause?

EDIT: I found the mistake in the code. I mistakenly set up the "max_primive_count" to 3, but it should be 1, since I only wanted to display one single triangle. Also the maxVertex should be ...
Ruslan's user avatar
  • 31
0 votes
1 answer
2k views

nsys profile multiple processes

I'd like to experiment with MPS on Nvidia GPUs, therefore I'd like to be able to profile two process running in parallel. With the, now deprecated, nvprof, there used to be an option "--profile-...
Blaizz's user avatar
  • 57
1 vote
0 answers
340 views

Nsys Profile with MPMD(multiple program and multiple data) simulation

I am trying to profile a MPI+OPENACC program with nsys. I am using OpenMPI(3.1.6) from Nvidia HPC SDK(20.7) with UCX enabled. There are three exectuables, exec1, exec2, exec3. I want to profile for ...
HEMANT GIRI's user avatar
-2 votes
1 answer
557 views

NSight Compute - expecting bank conflicts but not detecting any

I was trying to detect shared memory bank conflicts for matrix transposition kernels. The first kernel performs matrix transposition without padding, and hence should have bank conflicts, while the ...
loonatick's user avatar
  • 1,107
0 votes
1 answer
2k views

Tracing custom CUDA kernels with Nsight Systems

I work on library which is implemented in C++20 and CUDA 11. This library is called from Python via ctypes through a C API that just exchanges JSON strings. We compile it using Clang 11. In order to ...
Martin Ueding's user avatar
1 vote
1 answer
1k views

"Start Performance Analysis" button missing on Nsight + Visual Studio

I am usually debug my kernel and check timing with "Start Performance Analysis" Button. It shows When I used CUDA 10.2, RTX Titan V. But, That button now shown since I upgraded CUDA version ...
powermew's user avatar
  • 133
1 vote
1 answer
2k views

NVIDIA Nsight Systems CLI not getting memory statistics

I'm using NVIDIA Nsight Systems cli (nsys) to profile a simple cuda program (vectors adding). I've already checked the documentation but I think I'm missing something. I'm running the nsys profile ...
l.g.karolos's user avatar
  • 1,142
-3 votes
1 answer
528 views

Cuda debugging using Single GPU with visual studio

I am working on Windows 7, Visual studio 2010. Can we debug cuda code using single GPU which also providing display to the monitor in the same PC? What tools are available ? NSIGHT seems to be working ...
gpuguy's user avatar
  • 4,595
0 votes
1 answer
426 views

VS2019 Nsight extension installed, not showing up in Manage Extension and impossible to disable

I have the Nsight extension installed on VS2019 and it shows up in the menu: Unfortunately, it makes Intellisense unbearably slow, so I would like to disable that extension, however, it doesn't show ...
Damien's user avatar
  • 1,542

1
2 3 4 5
8