Skip to content

Latest commit

 

History

History
587 lines (452 loc) · 17.5 KB

bindings.md

File metadata and controls

587 lines (452 loc) · 17.5 KB

Tutorial: Polyglot Bindings to Kernel and Host Functions

GPU kernels and host function can be executed as function calls. The corresponding functions are callable objects that are bound to the respective kernel or host functions. grCUDA provides different ways to define these bindings:

  • bind(shareLibraryFile, functionNameAndSignature) returns a callable object to the specified host function defined in the shared library (.so file).
  • bindkernel(fileName, kernelNameAndSignature) returns a callable object to specified kernel function defined in PTX or cubin file.
  • bindall(targetNamespace, fileName, nidlFileName) registers all functions listed in the NIDL (Native Interface Definition Language) for the specified binary file into the target namespace of grCUDA.

The first two approaches are useful to implement the one-off binding to a native function, be it a kernel or a host function. bindall() is use to bind multiple functions from the same binary or PTX file. This tutorial shows how to call existing native host functions and kernels from GraalVM languages through grCUDA.

Binding and Invoking prebuilt Host Functions

Host functions can be bound from existing shared libraries by bind() or bindall(). The former returns one single native function as a callable object whereas later binds can be used to bind multiple functions into a specified namespace within grCUDA.

This simple example shows how to call two host functions from a shared library. One function is defined a C++ namespace. The other function is defined as extern "C", i.e,. it is exported via C ABI. The two functions take a pointer to device as argument and just launch the same kernel.

The source code for the shared library is as follows.

__global__ void inc_kernel(int * inout_arr, int num_elements) {
  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < num_elements;
       i += gridDim.x * blockDim.x) {
    inout_arr[i] += 1;
  }
}

static cudaError_t check_status_and_sync() {
  auto status = cudaGetLastError();
  if (status == cudaSuccess) {
    status = cudaDeviceSynchronize();
  }
  return status;
}

namespace aa {
namespace bb {
  // function defined in C++ namespace
  int cxx_inc_host(int * dev_inout_arr, int num_elements, int num_blocks, int block_size) {
    inc_kernel<<<num_blocks, block_size>>>(dev_inout_arr, num_elements);
    return check_status_and_sync();
  }
} // end of namespace bb
} // end of namespace aa

//function defined with C linkage
extern "C"
int c_inc_host(int * dev_inout_arr, int num_elements, int num_blocks, int block_size) {
  inc_kernel<<<num_blocks, block_size>>>(dev_inout_arr, num_elements);
  return check_status_and_sync();
}

Build the shared library (Linux).

nvcc -o libincrement.so -std=c++11 --shared -Xcompiler -fPIC increment.cu

bind() can be used to "import" a single function into grCUDA as shown in the following NodeJS/JavaScript example.

const cu = Polyglot.eval('grcuda', 'CU')

// bind to C++ function in namespace aa:bb
const cxx_inc_host = cu.bind('libincrement.so',
  `cxx aa::bb::cxx_inc_host(inout_arr: inout pointer sint32, n_elements: sint32,
                            n_blocks: sint32, block_size: sint32): sint32`)

// bind to extern "C" function
const c_inc_host = cu.bind('libincrement.so',
  `c_inc_host(inout_arr: inout pointer sint32, n_elements: sint32,
              n_blocks: sint32, block_size: sint32): sint32`)

const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) {
  deviceArray[i] = i
}

const res1 = cxx_inc_host(deviceArray, n, 32, 256)
const res2 = c_inc_host(deviceArray, n, 32, 256)
console.log(`res1: `, res1)
console.log(`res2: `, res2)

for (const el of deviceArray) {
  console.log(el)
}

bind() takes the name (or path) of the shared library. The second argument specifies the signature in NIDL format. Add the keyword cxx for C++ style functions. The C++ namespace can be specified using ::. Without cxx grCUDA assumes C linkage of the function and does not apply any name mangling. bind() returns the function objects as callables, i.e., TruffleObject instances for which isExecutable() is true.

The two functions can be bound in one single step using bindall() as shown below:

...
// bind host functions into namespace 'increment'
cu.bindall('increment', 'libincrement.so', 'increment_host.nidl')

const res1 = cu.increment.cxx_inc_host(deviceArray, n, 32, 256)
const res2 = cu.increment.c_inc_host(deviceArray, n, 32, 256)
...

The bindings are specified in a file increment_host.nidl:

hostfuncs aa::bb {
  cxx_inc_host(inout_arr: inout pointer sint32, n_elements: sint32,
               n_blocks: sint32, block_size: sint32): sint32
}

chostfuncs {
  c_inc_host(inout_arr: inout pointer sint32, n_elements: sint32,
             n_blocks: sint32, block_size: sint32): sint32
}

The C++ functions are enclosed in hostfuncs scope. The C++ namespace (e.g., aa::bb in the example above) is specified with the scope. Functions with C linkage have to be placed inside a chostfunc scope.

All functions will be registered with their name as it appears in the NIDL under the namespace that is provided as the first argument of bindall(), e.g., increment.

Binding and Launching prebuilt GPU Kernels

Existing CUDA kernels that are complied into cubin binaries or PTX assembly files can be bound to callable objects using the bindkernel() function and subsequently invoked from the various host languages in the Graal/Truffle ecosystem. This is analogous to bind() for host functions.

As in previous example for external host functions, we will also use two kernels. One kernel is defined in a C++ namespace, the other is defined with C linkage. The kernels are compiled into a cubin binary.

The source code for the cubin is as follows.

namespace aa {
namespace bb {
// kernel in C++ namespace
__global__ void inc_kernel(int * inout_arr, int num_elements) {
  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < num_elements;
       i += gridDim.x * blockDim.x) {
    inout_arr[i] += 1;
  }
}

} // end of namespace bb
} // end of namespace aa

// kernel with C linkage
extern "C"
__global__ void c_inc_kernel(int * inout_arr, int num_elements) {
  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < num_elements;
       i += gridDim.x * blockDim.x) {
    inout_arr[i] += 1;
  }
}

The kernels are compiled into a cubin binary by nvcc. Note that architecture and code version in gencode must match your generation. The options below are for a GeForce RTX device.

nvcc -cubin -gencode=arch=compute_75,code=sm_75 \
   -o increment.cubin increment_kernels.cu

bindkernel() "imports" a single kernel function into grCUDA. bindkernel() returns the kernel as a callable object. It can be called like a function. The parameters are the kernel grid size and as optional the amount dynamic shared memory. This is analogous to the kernel launch configuration in CUDA that is specified between <<< >>>.

The following example show how the kernels are bound and called from NodeJS JavaScript.

const cu = Polyglot.eval('grcuda', 'CU')

// bind to kernel function in namespace aa:bb
const inc_kernel = cu.bindkernel('increment.cubin',
  `cxx aa::bb::inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)`)

// bind to kernel defined with C linkage
const c_inc_kernel = cu.bindkernel('increment.cubin',
  `c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)`)

const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) {
  deviceArray[i] = i
}

const numBlocks = 32
const blockSize = 256

inc_kernel(numBlocks, blockSize)(deviceArray, n)
c_inc_kernel(numBlocks, blockSize)(deviceArray, n)

for (const el of deviceArray) {
  console.log(el)
}

In the host language, kernel launches are function evaluations with two argument lists, the first specifying the kernel grid, the second the arguments that are passed to the kernel. inc_kernel(numBlocks, blockSize)(deviceArray, n) corresponds to inc_kernel<<<numBlocks, blockSize>>>(deviceArrayPointer, n) in CUDA C/C++.

Just as for host functions, multiple kernel functions can be bound at the same time using bindall() from a cubin or PTX file.

const cu = Polyglot.eval('grcuda', 'CU')

// bind kernel functions into namespace 'increment'
cu.bindall('increment', 'increment.cubin', 'increment_kernels.nidl')

const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) {
  deviceArray[i] = i
}

const numBlocks = 32
const blockSize = 256

cu.increment.inc_kernel(numBlocks, blockSize)(deviceArray, n)
cu.increment.c_inc_kernel(numBlocks, blockSize)(deviceArray, n)

for (const el of deviceArray) {
  console.log(el)
}

The bindings are specified in a file increment_kernels.nidl:

kernels aa::bb {
  inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)
}

ckernels {
  c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)
}

If the a kernel function is not declared with the extern "C" nvcc generates C++ symbols for kernel functions. Such kernels can be enclosed in a kernels scope in the NIDL file and subsequently bound in one step. As in hostfuncs for C++ host functions, a C++ namespace can also be specified in kernels. grCUDA the searches all functions within the scope in this namespace.

Kernel function defined with extern "C" can bound in a ckernels scope. This scope specifies that no name mangling is used for kernels within the scope.

All kernel function will be registered with their name as it appears in the NIDL under the namespace that is provided as the first argument of bindall(), e.g., increment.

Runtime-compilation of GPU Kernels from CUDA C/C++

grCUDA can also compile GPU kernels directly from CUDA C/C++ source code passed as a host-string argument to buildkernel(..). The signature of the function is:

bindkernel(cudaSourceString, signatureString)

The cudaSourceString argument contains the kernel source code in CUDA C/C++ as a host-language string.

The second argument, signatureString, specifies signature of the kernel function, i.e., the name of the function and its parameters. The syntax of the signature string is identical as in the corresponding argument in the bindkernel(..) function.

Here is an example in Python that defines kernel function inc_kernel as a CUDA C/C++ string and uses bindkernel(), which in turn leverages the NVIDIA Runtime Compiler (NVRTC) to create a callable kernel object.

import polyglot

cu = polyglot.eval(language='grcuda', string='CU')

kernel_source = """
__global__
void inc_kernel(int* arr_inout, size_t n_elements) {
  for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < n_elements;
       i += blockDim.x * gridDim.x) {
    arr_inout[i] += 1;
  }
}"""
inc_kernel = cu.buildkernel(kernel_source,
    'inc_kernel(arr_inout: inout pointer sint32, n_elements: uint64)')

n = 100
device_array = cu.DeviceArray('int', n)
for i in range(n):
    device_array[i] = i

num_blocks = 32
block_size = 256
inc_kernel(num_blocks, block_size)(device_array, n)

print(device_array)

Launching Kernels

Once a kernel function is bound to a callable host-object or registered as a function within grCUDA, it can be launched like a function with two argument lists (for exceptions in Ruby and Java and Ruby see the examples below).

kernel(num_blocks, block_size)(arg1, ..., argN)

This form is based on the triple angle brackets syntax of CUDA C/C++.

kernel<<<num_blocks, block_size>>>(arg1, ..., argN)

The first argument list corresponds to the launch configuration, i.e., the kernel grid (number of blocks) and the block sizes (number of threads per block).

grCUDA currently only supports synchronous kernel launches, i.e., there is an implicit cudaDeviceSynchronize() after every launch.

Examples:

num_blocks = 160
block_size = 1024
# concatentated invocation using both argument lists
kernel(num_blocks, block_size)(out_arr, in_arr, num_elements)

# kernel with launch config yields in a kernel with a configured grid
configured_kernel = kernel(num_blocks, block_size)
configured_kernel(out_arr, in_ar, num_elements)

grCUDA also supports 2D and 3D kernel grids that are specified with the dim3 in CUDA C/C++. In grCUDA num_blocks and block_size can be integers for 1-dimensional kernels or host language sequences of length 1, 2, or 3 (Lists or Tuples in Python, Arrays in JavaScript and Ruby, and vectors in R)

matmult((num_blocks_x, num_blocks_y),
        (block_size_x, block_size_y))(matrix_A, matrix_B, n, m, k)
fft3d((nx, ny, nz), (bx, by, bz))(matrix_A, matrix_B, n, m, k)

Additionally, the number of bytes to allocate for shared memory per thread block is passed as the third argument in the kernel configuration call, as in CUDA's <<< >>> notation.

num_shmem_bytes = 1024
kernel_using_shmem(num_blocks, block_size, num_shmem_bytes)(
                   inputA, inputB, results, N)

The syntax of most Graal language allows a concatenated invocation with both argument lists. The exceptions are Ruby and Java:

In Ruby, callables need to be invoked using the call(..) method, resulting in the following chained invocation:

kernel.call(num_blocks, block_size).call(out_arr, in_arr, num_elements)

For Java, the execute() needs to be used to invoke a Truffle callable. Thus, a kernel needs to be launched as follows from Java:

kernel.execute(numBlocks, blockSize).execute(outArr, inArr, numElements)

Host Language Examples

This section shows how to bind an existing kernel function and launching the kernel from different host languages.

Python

import polyglot

cu = polyglot.eval(language='grcuda', string='CU')

inc_kernel = cu.bindkernel('increment.cubin',
    'c_inc_kernel(arr_inout: inout pointer sint32, n_elements: uint64)')

n = 100
device_array = cu.DeviceArray('int', n)
for i in range(n):
    device_array[i] = i

num_blocks = 32
block_size = 256
inc_kernel(num_blocks, block_size)(device_array, n)

print(device_array)

NodeJS/JavaScript

const cu = Polyglot.eval('grcuda', 'CU')

const inc_kernel = cu.bindkernel('increment.cubin',
  'c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)')

const n = 100
let deviceArray = cu.DeviceArray('int', n)
for (let i = 0; i < n; i++) {
  deviceArray[i] = i
}

const numBlocks = 32
const blockSize = 256
inc_kernel(numBlocks, blockSize)(deviceArray, n)

for (const el of deviceArray) {
  console.log(el)
}

Ruby

cu = Polyglot.eval("grcuda", "CU")

inc_kernel = cu.bindkernel("increment.cubin",
  "c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)")

num_elements = 100
device_array = cu.DeviceArray("int", num_elements)
for i in 0...num_elements do
  device_array[i] = i
end

num_blocks = 32
block_size = 256
inc_kernel.call(num_blocks, block_size).call(device_array, num_elements)

puts (0...num_elements).reduce("") {
  |agg, i| agg + device_array[i].to_s + " "
}

R

cu <- eval.polyglot('grcuda', 'CU')

inc_kernel <- cu$bindkernel('increment.cubin',
  'c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)')

num_elements <- 100
device_array <- cu$DeviceArray('int', num_elements)
for (i in 1:num_elements) {
  device_array[i] = i
}

num_blocks <- 32
block_size <- 256
inc_kernel(num_blocks, block_size)(device_array, num_elements)

print(device_array)

Java

import java.util.stream.IntStream;
import java.util.stream.Collectors;
import org.graalvm.polyglot.Context;
import org.graalvm.polyglot.Value;

class Increment {
  public static void main(String[] args) {
    Context polyglot = Context.newBuilder().allowAllAccess(true).build();
    Value cu = polyglot.eval("grcuda", "CU");

    Value incKernel = cu.invokeMember("bindkernel", "increment.cubin",
        "c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)");

    int numElements = 1000;
    Value deviceArray = cu.invokeMember("DeviceArray", "int", numElements);
    for (int i = 0; i < numElements; ++i) {
      deviceArray.setArrayElement(i, i);
    }

    int gridSize = 32;
    int blockSize = 256;
    incKernel.execute(gridSize, blockSize).execute(deviceArray, numElements);

    System.out.println(
      IntStream.range(0, numElements)
        .mapToObj(deviceArray::getArrayElement)
        .map(Object::toString)
        .collect(Collectors.joining(", "))
    );
  }
}

C through Sulong

#include <stdio.h>
#include <polyglot.h>

int main() {
  void * cu = polyglot_eval("grcuda", "CU");

  const char * charset = "ISO-8859-1";
  typedef void (*kernel_type) (int*, int);
  const kernel_type (*inc_kernel)(int, int) = polyglot_invoke(
    cu,
    "bindkernel",
    polyglot_from_string("increment.cubin", charset),
    polyglot_from_string(
      "c_inc_kernel(inout_arr: inout pointer sint32, n_elements: sint32)",
      charset)
  );

  const int num_elements = 1000;
  int32_t * device_array = polyglot_as_i32_array(
    polyglot_invoke(cu, "DeviceArray",
      polyglot_from_string("int", charset), num_elements));
  for (int i = 0; i < num_elements; ++i) {
    device_array[i] = i;
  }

  const int num_blocks = 32;
  const int block_size = 256;
  inc_kernel(num_blocks, block_size)(device_array, num_elements);

  for (int i = 0; i < num_elements; ++i) {
    printf("%d\n", device_array[i]);
  }
}

Compilation and execution of example using the Sulong's LLVM interpreter:

clang -g -O1 -c -emit-llvm -I${GRAALVM_DIR}/jre/languages/llvm/include increment.c
lli --polyglot --jvm --vm.Dtruffle.class.path.append=${GRCUDA_JAR} increment.bc