1

I have followed the guidance offered by this question and this link which deal with the concepts of passing arrays of pointers to a device and back but I seem to be struggling with my specific case when the pointer point to an object. See example code below where I have removed error checking for brevity.

// Kernel
__global__ void myKernel(Obj** d_array_of_objs)
{
    // Change the scalar of each object to 5
    // by dereferencing device array to get 
    // appropriate object pointer.
    *d_array_of_objs->changeToFive();    <--------- SEE QUESTION 4
}

// Entry point
int main()
{

    /********************************/
    /* INITIALISE OBJ ARRAY ON HOST */
    /********************************/

    // Array of 3 pointers to Objs
    Obj* h_obj[3];
    for (int i = 0; i < 3; i++) {
        h_obj[i] = new Obj();       // Create
        h_obj[i]->scalar = i * 10;  // Initialise
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }


    /**************************************************/
    /* CREATE DEVICE VERSIONS AND STORE IN HOST ARRAY */
    /**************************************************/

    // Create host pointer to array-like storage of device pointers
    Obj** h_d_obj = (Obj**)malloc(sizeof(Obj*) * 3);    <--------- SEE QUESTION 1
    for (int i = 0; i < 3; i++) {
        // Allocate space for an Obj and assign
        cudaMalloc((void**)&h_d_obj[i], sizeof(Obj));
        // Copy the object to the device (only has single scalar field to keep it simple)
        cudaMemcpy(h_d_obj[i], &(h_obj[i]), sizeof(Obj), cudaMemcpyHostToDevice);
    }

    /**************************************************/
    /* CREATE DEVICE ARRAY TO PASS POINTERS TO KERNEL */
    /**************************************************/

    // Create a pointer which will point to device memory
    Obj** d_d_obj = nullptr;
    // Allocate space for 3 pointers on device at above location
    cudaMalloc((void**)&d_d_obj, sizeof(Obj*) * 3);
    // Copy the pointers from the host memory to the device array
    cudaMemcpy(d_d_obj, h_d_obj, sizeof(Obj*) * 3, cudaMemcpyHostToDevice);


    /**********
     * After the above, VS2013 shows the memory pointed to by d_d_obj 
     * to be NULL <------- SEE QUESTION 2.
     **********/


    // Launch Kernel
    myKernel <<<1, 3>>>(d_d_obj);

    // Synchronise and pass back to host
    cudaDeviceSynchronize();
    for (int i = 0; i < 3; i++) {
        cudaMemcpy(&(h_obj[i]), h_d_obj[i], sizeof(Obj), cudaMemcpyDeviceToHost);     <--------- SEE QUESTION 3
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }

    return 0;
}

So the questions are:

  1. If the line indicated by SEE QUESTION 1 above allocates host memory for the pointers, and once I have used cudaMalloc in the subsequent loop to allocate device memory, the pointer pointed to by h_d_obj get overwritten with device addresses, does that mean I have allocated host memory for 3 Obj* that now has no pointer pointing to it?

  2. Why is the cudaMemcpy succeeding when I test the status returned but clearly does not copy the addresses correctly? I was expecting the "arrays" of memory address of both h_d_obj and d_d_obj to be the same since they should point to the same Obj in the device address space.

  3. At the line SEE QUESTION 3, assuming I'm correct in question 2. I also expect to be able to use either h_d_obj or d_d_obj to retrieve the Obj objects from the device since the difference would be only whether I dereference a host pointer to access a device pointer to Obj or a device pointer both of which I can do in a cudaMemcpy method right? If I use what is written, the copy succeeds but the pointer at h_obj[0] is corrupted and I cannot write out the data.

  4. At the line SEE QUESTION 4, why can I not dereference an Obj** to get a Obj* then use the -> operator to call a device method? The compiler moans that it is not a pointer to class type which the fact that it is a Obj* tells me it is.

1 Answer 1

5

First of all, it's convenient if you provide a complete code, including a definition for the Obj class. I've provided one based on inspection of your code and some guesswork.

Second, much of your confusion here seems to be a less-than-crisp facility with pointers in C (or C++). Using the CUDA API with double-pointer constructions (**) between host and device requires a crisp understanding and ability to visualize what is happening.

If the line indicated by SEE QUESTION 1 above allocates host memory for the pointers, and once I have used cudaMalloc in the subsequent loop to allocate device memory, the pointer pointed to by h_d_obj get overwritten with device addresses, does that mean I have allocated host memory for 3 Obj* that now has no pointer pointing to it?

No. h_d_obj is established (i.e. given a meaningful value) by the malloc operation. Nothing you have done subsequent to that modifies the value of h_d_obj.

Why is the cudaMemcpy succeeding when I test the status returned but clearly does not copy the addresses correctly? I was expecting the "arrays" of memory address of both h_d_obj and d_d_obj to be the same since they should point to the same Obj in the device address space.

I don't see anything wrong with your code up to this point. The value of h_d_obj was established (previously) by malloc, and the numerical value of it is an address in host memory. The value of d_d_obj was established by cudaMalloc, and the numerical value of it is an address in device memory. Numerically, I would expect them to be different.

At the line SEE QUESTION 3, assuming I'm correct in question 2. I also expect to be able to use either h_d_obj or d_d_obj to retrieve the Obj objects from the device since the difference would be only whether I dereference a host pointer to access a device pointer to Obj or a device pointer both of which I can do in a cudaMemcpy method right? If I use what is written, the copy succeeds but the pointer at h_obj[0] is corrupted and I cannot write out the data.

NO. You cannot dereference a device pointer in host code, even if it is a parameter in cudaMemcpy. This is legal as a source or destination in a cudaMemcpy operation:

h_d_obj[i]

This is not legal:

d_d_obj[i]

The reason is that in order to get the actual target address, I must dereference a host pointer (i.e. access a memory location on the host) in the first case, but a device pointer in the second case. From host code, I can retrieve the contents of h_d_obj[i]. I am not allowed to try to retrieve the contents of d_d_obj[i] in host code (and the parameter manipulation for cudaMemcpy is host code). The value of d_d_obj can be used as a destination from host code. d_d_obj[i] cannot.

At the line SEE QUESTION 4, why can I not dereference an Obj** to get a Obj* then use the -> operator to call a device method? The compiler moans that it is not a pointer to class type which the fact that it is a Obj* tells me it is.

The compiler is barking at you because you don't understand the order of operations between the various operators (*, ->) that you are using. If you add parenthesis to identify the correct order:

(*d_array_of_objs)->changeToFive(); 

Then the compiler won't object to that (although I would do it slightly differently as below).

Here's a modified version of your code with the Obj definition addition, a slight change to the kernel so that independent threads work on independent objects, and a few other fixes. Your code was mostly correct:

$ cat t1231.cu
#include <iostream>

class Obj{

  public:
  int scalar;
  __host__ __device__
  void changeToFive() {scalar = 5;}
};

// Kernel
__global__ void myKernel(Obj** d_array_of_objs)
{
    // Change the scalar of each object to 5
    // by dereferencing device array to get
    // appropriate object pointer.
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    // (*d_array_of_objs)->changeToFive();  //  <--------- SEE QUESTION 4 (add parenthesis)
    d_array_of_objs[idx]->changeToFive();
}

// Entry point
int main()
{

    /********************************/
    /* INITIALISE OBJ ARRAY ON HOST */
    /********************************/

    // Array of 3 pointers to Objs
    Obj* h_obj[3];
    for (int i = 0; i < 3; i++) {
        h_obj[i] = new Obj();       // Create
        h_obj[i]->scalar = i * 10;  // Initialise
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }


    /**************************************************/
    /* CREATE DEVICE VERSIONS AND STORE IN HOST ARRAY */
    /**************************************************/

    // Create host pointer to array-like storage of device pointers
    Obj** h_d_obj = (Obj**)malloc(sizeof(Obj*) * 3); //    <--------- SEE QUESTION 1
    for (int i = 0; i < 3; i++) {
        // Allocate space for an Obj and assign
        cudaMalloc((void**)&h_d_obj[i], sizeof(Obj));
        // Copy the object to the device (only has single scalar field to keep it simple)
        cudaMemcpy(h_d_obj[i], &(h_obj[i]), sizeof(Obj), cudaMemcpyHostToDevice);
    }

    /**************************************************/
    /* CREATE DEVICE ARRAY TO PASS POINTERS TO KERNEL */
    /**************************************************/

    // Create a pointer which will point to device memory
    Obj** d_d_obj = NULL;
    // Allocate space for 3 pointers on device at above location
    cudaMalloc((void**)&d_d_obj, sizeof(Obj*) * 3);
    // Copy the pointers from the host memory to the device array
    cudaMemcpy(d_d_obj, h_d_obj, sizeof(Obj*) * 3, cudaMemcpyHostToDevice);


    /**********
     * After the above, VS2013 shows the memory pointed to by d_d_obj
     * to be NULL <------- SEE QUESTION 2.
     **********/


    // Launch Kernel
    myKernel <<<1, 3>>>(d_d_obj);

    // Synchronise and pass back to host
    cudaDeviceSynchronize();
    for (int i = 0; i < 3; i++) {
        cudaMemcpy(h_obj[i], h_d_obj[i], sizeof(Obj), cudaMemcpyDeviceToHost);  //   <--------- SEE QUESTION 3  remove parenthesis
    }

    // Write out
    for (int i = 0; i < 3; i++) {
        std::cout << h_obj[i]->scalar << std::endl;
    }

    return 0;
}
$ nvcc -o t1231 t1231.cu
$ cuda-memcheck ./t1231
========= CUDA-MEMCHECK
0
10
20
5
5
5
========= ERROR SUMMARY: 0 errors
$

A diagram of h_d_obj and d_d_obj might help:

HOST                               |    DEVICE
h_d_obj-->(Obj *)-------------------------->Obj0<---(Obj *)<----|
          (Obj *)-------------------------->Obj1<---(Obj *)     |
          (Obj *)-------------------------->Obj2<---(Obj *)     |
                                   |                            |
d_d_obj---------------------------------------------------------|
HOST                               |    DEVICE

You're allowed to access any quantity (location) on the left hand side (HOST) of the above diagram, in host code, or in a cudaMemcpy operation. You're not allowed to access any quantity (location) on the right hand side, in host code.

2
  • Thank you @Robert Crovella for your answer. The diagram is very helpful. Also thank you for being so polite in by describing my confusion as "less-than-crisp", very kind of you :) Commented Aug 17, 2016 at 7:27
  • I think I understand my mis-understanding in question 1. Obj** h_d_obj = (Obj**)malloc(sizeof(Obj*) * 3); allocates host memory enough to hold 3 pointers and returns a pointer to the memory to h_d_obj then cudaMalloc((void**)&h_d_obj[i], sizeof(Obj)); dereferences h_d_obj[i] to get the i-th pointer in host memory (although this pointer has never been set), then gets the address of this pointer with & to which cudaMalloc returns the address of an Obj in device memory. So h_d_obj points to host memory with device pointers and although h_d_obj has changed h_d_obj hasn't. Commented Aug 17, 2016 at 7:32

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.