387 questions
0
votes
0
answers
18
views
Code Optimization for For Loops in Task Parameterized Gaussian Mixture Models
I'm currently developing TPGMM from Salinon's work (https://calinon.ch/papers/Calinon-JIST2015.pdf). I'm dealing with large matrix operations using CuPy as shown in the code below. However, I'm having ...
0
votes
1
answer
20
views
cupy.nanargmax throwing exception
I have a 2D array allocated on GPU and I need to use the cuPy's nanargmax() function to find the maximum value's index in each row. Some of the values could be NaN. Since the 2D array is quite large (...
0
votes
1
answer
73
views
Named symbol not found when use cupy to invoke cuda kernel
This is my cuda kernel: https://pastebin.com/ti95Qy2p, and I want to invoke compute_linear_recurrence method in this kernel.
But when I use code:
import cupy as cp
# code_str is code in https://...
0
votes
1
answer
37
views
CompileException occurs when compile .cu file with cupy
I have a .cu file with these heads:
#include </usr/include/features.h>
#include </usr/include/assert.h>
#include </usr/include/stdio.h>
When I use nvcc command to compile this file, ...
1
vote
1
answer
172
views
Batched matrix multiplication with JAX on GPU faster with larger matrices
I'm trying to perform batched matrix multiplication with JAX on GPU, and noticed that it is ~3x faster to multiply shapes (1000, 1000, 3, 35) @ (1000, 1000, 35, 1) than it is to multiply (1000, 1000, ...
1
vote
1
answer
88
views
Understanding the permutation test
I'm attempting to optimize the performance of the permutation test implemented in scipy.stats. My dataset consists of 500,000 observations, each associated with 2,000 binary covariates. I've applied ...
0
votes
1
answer
66
views
Raw kernel with dynamically allocated shared memory
Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float*...
3
votes
1
answer
128
views
Why (x / y)[i] faster than x[i] / y[i]?
I'm new to CuPy and CUDA/GPU computing. Can someone explain why (x / y)[i] faster than x[i] / y[i]?
When taking advantage of GPU accelerated computations, are there any guidelines that would allow me ...
0
votes
0
answers
16
views
AttributeError: module 'cupy' has no attribute 'ctypeslib'
Trying to import from numpy.ctypeslib import ndpointer but from cupy.
Any ideas when this will be implemented or if there is a workaround for this?
1
vote
0
answers
106
views
Solving sparse linear system on GPU is much slower than CPU
Below is the code which solves a sparse linear system:
import cupyx
import cupyx.scipy.sparse.linalg
import time
import scipy
import scipy.sparse.linalg
import pathlib
file_dir = str(pathlib.Path(...
1
vote
0
answers
29
views
Manual indexing with multidimensional cupy ndarray in user defined kernels
In the cupy docs on user defined kernels (https://docs.cupy.dev/en/stable/user_guide/kernel.html), there is a section defining certain variables that are predefined, like _ind.size() and i for things ...
0
votes
0
answers
15
views
How do I redirect the Cuda kernel IO in Cupy?
import sys
class Logger:
def __init__(self, filename):
self.console = sys.stdout
self.file = open(filename, 'w')
def write(self, message):
self.console.write(message)
...
0
votes
0
answers
34
views
Any support for cuTENSORMg in python?
I have recently been looking into cuTENSOR (+cupy) for a speedy tensor contraction GPU library, and have been wanting to extend my single GPU code to multi GPU distributed code via cuTENSORMg;
however,...
0
votes
1
answer
97
views
cuSPARSELt not found by CuPy
I have a hard time getting CuPy to detect and use, where applicable, the cuSPARSELt library in Windows. I tried installing versions 0.2.0 (as mentioned by CuPy's installation guide) and 0.6.2 (the ...
0
votes
0
answers
26
views
send/recv block in CuPy
I want to send a cupy array from one node to the other.
The sender has the following code:
import cupy
import cupyx.distributed
import torch.multiprocessing as mp
def send():
cupy.cuda.Device(0)....
1
vote
1
answer
64
views
Fast square of absolute value of complex numbers with cupy or otherwise
When one is comparing the magnitudes of complex numbers (essentially sqrt(real² + imag²)) to find the largest absolute values, it would suffice to compare the square of the absolute values, thereby ...
2
votes
0
answers
39
views
cupy.linalg.solve for positive definite matrix?
It seems like cupy.linalg.solve doesn't have an option for me to solve linear system Ax=b assumingA is positive definite?
I am looking for something like scipy.linalg.solve where one can actually tell ...
0
votes
1
answer
193
views
Cannot use GPU, CuPy is not installed
I have a GPU enabled machine.
O.S: Ubuntu 20.04.6 LTS
nvcc version: 12.2
Nvidia Driver Version: 535.183.01
Pytorch version 2.3.1+cu121
spaCy version 3.7.5
Python version 3.8.10
Pipelines : ...
0
votes
0
answers
48
views
Python app using cupy and cupyx fails with cl.exe not found: how to package to work with no cl.exe on target machine
The app is built in Python on Windows 10 and make heavy use of cupy and cupyx.scipy.ndimage, and a few other cupyx libraries: It is distributable and it works. It now needs to go to a more secure ...
0
votes
0
answers
29
views
Why does the compute sanitizer not detect leaks in CuPy kernels?
kernel = r"""
extern "C" __global__ void entry0() {
if (threadIdx.x == 0)
malloc(16);
return ;
}
"""
import cupy as cp
raw_module = cp....
0
votes
1
answer
188
views
Access CUDAarray in CuPy using pointer from C++
I'm trying to allocate a CUDAarray (as in, texture memory) in c++ and pass the pointer up to CuPy. From there, I would like to treat it as an ndarray.
Many examples show how to cudaMalloc() linear ...
0
votes
1
answer
195
views
Cupy copy numpy array to existing device array
I would like to copy a numpy array on an existing, pre-allocated, gpu array.
I've seen that cupy offers the functions copy and copyto, however the former does not allow to specify the destination ...
1
vote
1
answer
241
views
Cannot use GPU for custom spaCy NER model
I'm trying to make a custom NER model using spaCy. When I try to leverage gpu it throws an error stating that Cupy is not installed even though it is. Attaching relevant info below.
> ubuntu@:~$ ...
0
votes
1
answer
252
views
CuPy takes more time to preprocess the image?
import cv2
import numpy as np
import cupy as cp
import time
def op_image(image):
start_time = time.time()
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (640, ...
0
votes
0
answers
17
views
Assertion error when insering large sized data into a spawned process using multiprocessing with queue
I'm trying to spawn a new process and run a certain function with it that uses cupy to preform heavy computations on the GPU.
Each process have the following worker instance (code snippet not the full ...
0
votes
1
answer
326
views
Open3D can't call function 'read_point_cloud'/Module 'Open3D' has no attribute 'read_point_cloud'
So I cloned a git repository which is registrating point clouds using probabilistic methods such as GMMs (Gaussian-Mixture-Models) but also incorporates Open3D. Because the registration is running on ...
1
vote
0
answers
69
views
Ensure same seed generates same random numbers when using numpy and cupy
I import numpy or cupy as follows:
import numpy as np
# import cupy as np
Then I generate X as follows:
np.random.seed(0)
X = np.random.rand(4, 3)
I get two very different matrices depending on ...
0
votes
0
answers
30
views
When compiling Cuda modules with Cupy, how do I get the diagnostic results from the `-ptxas=v` option?
I am not sure if this even the right question to ask, but when I compile the Cuda Visual Studio example project, there is an --ptxas=v option to turn on diagnostics which prints out the local memory ...
0
votes
0
answers
96
views
How to use cuSOLVER Multi-GPU to get eigenvalues with CuPy
I have a large, dense, symmetric matrix (~50000x50000) that I want to calcuate the eigenvalues of.
I tried cupy.linalg.eigvalsh(), but ran out of memory on a single GPU. I know the cuSOLVER library ...
4
votes
1
answer
236
views
Diagonalising matrices that are too large for gpu memory
I want to diagonalise matrices which are too large for the amount of memory available on the gpu. I am interested in any approaches that would allow me to gain some speed up over just diagonalising ...
0
votes
0
answers
48
views
How can an application using CuPy be deployed? (VC++ dependency)
I've had no problems using the given documentation to install CuPy and develop with it on my own machine. But I'm seeing a roadblock in deploying applications using CuPy in a commercial setting to ...
0
votes
0
answers
50
views
Achieving parallelism with multiprocessing on batches of data
I have data that I want to preform a batched process on using two GPUs in parallel, so I wrote the following worker class:
class Worker(multiprocessing.Process):
def __init__(self, gpu_id: int, ...
1
vote
1
answer
67
views
Runtime Error coccures when using torchsummary
My code is like below.
import numpy as np
import torch
import torch.nn as nn
import cupy as cp
from torchviz import make_dot
from torchinfo import summary
from torchsummary import summary as summary_
...
0
votes
0
answers
79
views
cupy.transpose() not working the same as numpy.transpose()
I am trying to get the indicies when finding the non-zero values of an array using cupy. I first used Numpy as so np.transpose(np.nonzero(a)) which works just fine, but when changing it to cupy cp....
1
vote
1
answer
299
views
How to get all available devices for CuPy?
How can I get all available devices for CuPy? I'm looking to write a version of this for CuPy:
if is_torch(xp):
devices = ['cpu']
import torch # type: ignore[import]
num_cuda = torch.cuda....
0
votes
0
answers
44
views
Finite difference code for 2D and 3D arrays using CuPy ElementwiseKernel
I am trying to write a finite difference Python code for 2D and 3D arrays using CuPy library. I want to use cupy.ElementwiseKernel() to speed up my code. Currently, I am facing problems to write the ...
0
votes
0
answers
13
views
How do I get the PTX in text form when compiling with CuPy for both NVRTC and NVCC backends?
The warp matrix multiply functions are producing vomit in the SASS output, so I want to study the PTX itself to see whether that is the library's or the PTX's fault.
0
votes
0
answers
20
views
When launching a kernel with cooperative threads, and it reduces the number of blocks per grid, how do I make that an error instead of a warning?
This is for a CuPy kernel. I get the following warning when I try to launch it.
UserWarning: The grid size will be reduced from 48 to 24, as the specified grid size exceeds the limit.
I am doing some ...
2
votes
2
answers
102
views
Making masks based on euclidean distance with pyopencl, arrayfire or another python opencl library
I am doing 2D or 3D binary masks around given coordinates and then identifying them as labels with scipy.ndimage.label.
Now, I have a cupy solution, a numpy solution. Cupy is fast, numpy is very slow, ...
0
votes
0
answers
145
views
cupy_backends.cuda.libs.curand.CURANDError: CURAND_STATUS_INITIALIZATION_FAILED
When I run cupy.random.seed(123), the error below occurred.
>>> cupy.random.seed(123)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File &...
0
votes
1
answer
235
views
jitify file not found
I am absolutely new to python programming anf VS code. I want to do GPU programming and i installed CUDA toolkit, pip installed cupy and tried running gpu codes but i get this runtime error
../...
1
vote
1
answer
235
views
Using CuPy on Maxwell GPU
Anyone here trying to use cupy on a Maxwell GPU? I am trying to do a simple array.mean() operation and getting the message below. Is there a way I can get around this? Do I need to install a different ...
1
vote
0
answers
149
views
Saving a Cupy array directly to JPEG without converting to NumPy
I'm currently facing a challenge in my project where I need to save large Cupy arrays directly to JPEG files without the intermediate step of converting them to NumPy arrays due to performance ...
-2
votes
2
answers
122
views
How do I parallelize a set of matrix multiplications
Consider the following operation, where I take 20 x 20 slices of a larger matrix and dot product them with another 20 x 20 matrix:
import numpy as np
a = np.random.rand(10, 20)
b = np.random.rand(20, ...
3
votes
0
answers
352
views
cupy cooperative_groups.h: [jitify] File not found
from cupyx.scipy.signal import convolve2d as convolve2d_gpu
convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu)
%timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, ...
0
votes
3
answers
124
views
More efficient way of looping over a multidimensional numpy array other than numpy.where
I have a nested array of shape: [200, 500, 1000].
Each index represents a coordinate of an image, eg array[1, 2, 3] would give me the value of the array at x=1, y=2, and z=3 in coordinate space. I ...
-1
votes
2
answers
127
views
Fast tensor-dot on sparse arrays with GPU in any programming language?
I'm now working on two multi-dimensional arrays arr and cost. arr is dense with size (width, height, m, n) and cost is sparse with size (width, height, width, height).
Values: width and height are ...
3
votes
2
answers
345
views
cupy runtime compilation failed
I'm new to cupy and try to learn it.
This following code provides an error using cuda11
import numpy
import cupy
def monte_carlo_gpu(n:int, m:int)-> float:
accum = 0
for i in range(m):...
0
votes
0
answers
45
views
How to convert chainer.Variable to PyTorch Tensor?
When try to run a piece of code from neural_renderer, it report the following error.
The code is based on Cuda 9.2, and I have to upgrade to Cuda 11.1 in order to support latest GPU,
chainer upgrade ...
0
votes
1
answer
145
views
How do I pass in the `--gpu-architecture=compute_89` into a NVRTC kernel with CuPy?
cp.RawModule(code=kernel, backend='nvrtc', options=('--gpu-architecture=compute_89',))
When I try to do it like this, I get an error that the option has already been passed in. Do I have to build the ...