cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix:

D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)

where op(A)/op(B) refers to in-place operations such as transpose/non-transpose, and alpha, beta are scalars or vectors.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Download: developer.nvidia.com/cusparselt/downloads

Provide Feedback: Math-Libs-Feedback@nvidia.com

Examples: cuSPARSELt Example 1, cuSPARSELt Example 2

Blog post:

Key Features#

  • NVIDIA Sparse MMA tensor core support

  • Mixed-precision computation support:

    Input A/B

    Input C

    Output D

    Compute

    Support arch

    FP32

    FP32

    FP32

    FP32

    SM 8.0, 8.6, 8.7, 9.0

    BF16

    BF16

    BF16

    FP32

    FP16

    FP16

    FP16

    FP32

    FP16

    FP16

    FP16

    FP16

    SM 9.0

    INT8

    INT8

    INT8

    INT32

    SM 8.0, 8.6, 8.7, 9.0

    INT32

    INT32

    FP16

    FP16

    BF16

    BF16

    E4M3

    FP16

    E4M3

    FP32

    SM 9.0

    BF16

    E4M3

    FP16

    FP16

    BF16

    BF16

    FP32

    FP32

    E5M2

    FP16

    E5M2

    FP32

    SM 9.0

    BF16

    E5M2

    FP16

    FP16

    BF16

    BF16

    FP32

    FP32

  • Matrix pruning and compression functionalities

  • Activation functions, bias vector, and output scaling

  • Batched computation (multiple matrices in a single run)

  • GEMM Split-K mode

  • Auto-tuning functionality (see cusparseLtMatmulSearch())

  • NVTX ranging and Logging functionalities

Support#

  • Supported SM Architectures: SM 8.0, SM 8.6, SM 8.7, SM 8.9, SM 9.0

  • Supported CPU architectures and operating systems:

OS

CPU archs

Windows

x86_64

Linux

x86_64, Arm64

Index#