5

I'm using GCC 9.3 on Ubuntu 20.04. I want to offload the famous SAXPY example to GPU using OpenMP. I installed GCC's offloading capabilities by sudo apt install gcc-9-offload-nvptx . Then compiled the following code by g++ -fopenmp main.cpp:

int main()
{
    const size_t kNumel = 999999;

    float x[kNumel];
    float y[kNumel];

    for (size_t i=0 ;i <kNumel; i++)
    {
        x[i] = i;
        y[i] = i;
    }


    const float kCoef = 1.23f;

    #pragma omp target teams distribute parallel for
    for (size_t i=0; i < kNumel; i++)
    {
        y[i] = kCoef*x[i] + y[i];
    }

    return 0;
}

BUT it doesn't compile and shows this error:

to1: error: ‘-fcf-protection=full’ is not supported for this target
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-9 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/9//accel/nvptx-none/mkoffload 
returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

I added -fno-stack-protector but the same error is reproduced.

6
  • Ubuntu and its default flags... The error message is pretty clear though, I don't know why you think that -fno-stack-protector is the way to disable -fcf-protection=full... Commented Feb 11, 2021 at 17:28
  • It was mentioned in other forums that fno-stack-protector is required for Ubuntu 18.04. So I tried it. The "fcf-protection" can get these values: [full,branch,return, check, none]. I tested all. No success:( When it is set to "none" or "check", the compiler throws this: unresolved symbol __stack_chk_fail Commented Feb 12, 2021 at 5:46
  • Managing to change the error message is progress, "none" seems good. Now to try and solve the next issue, the unresolved symbol ;-) Commented Feb 12, 2021 at 9:30
  • 2
    Thanks @MarcGlisse. I put both -fcf-protection=none and -fno-stack-protector. It compiles now. But the runtime is orders of magnitude slower than CPU version. I have to find good OpenMP clauses. Commented Feb 12, 2021 at 10:50
  • The GPU would chew through the data in a matter of microseconds. It is the time it takes to allocate data buffers on the GPU, copy the data from the host to the GPU, launch the kernel, wait for it to finish, and copy the data back from the GPU to the host, that makes the code orders of magnitude slower. You need to put MUCH more work in that loop. Commented Feb 12, 2021 at 22:49

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.