All Questions
209 questions
0
votes
0
answers
23
views
Processing the same array, dask.array is too slow compared to numpy.array
import BWStest as bws
import numpy as np
from skimage.measure import label
import dask.array
from tqdm import tqdm
CalWin = [7,25]
stack = []
thershold = 0.05
for i in range(5):
image = np.random....
0
votes
0
answers
40
views
Why is joblib's Parallel delayed faster than dasks map block and compute()
This question is possibly related to this one. I have 4D numpy array and would like to apply a function to each 2D slice across the first two dimensions. I have implemented the analysis for both dask ...
0
votes
0
answers
65
views
How to apply a function to each 2D slice of a 4D Numpy array in parallel with Dask without running out of RAM?
I want to apply a function to each 2D slice of a 4D Numpy array using Dask. The output should be a 2D matrix (the function applied to each 2D slice returns a single value). I would like to do this in ...
0
votes
1
answer
88
views
Optimizing Pandas GroupBy and Aggregation on Large Datasets with Multiple Custom Functions
I'm working with a large Pandas dataframe (about 30.5 million rows) where I need to group by multiple columns and apply different custom aggregation functions. However, the performance is currently a ...
1
vote
0
answers
70
views
Working with larger than memory data in numpy
I am working on a project that involves larger than memory numpy 3 dimensional arrays. The project will be deployed with AWS lambda. I am faced with two design choices
a) Re-write large parts of the ...
-1
votes
1
answer
47
views
Is there an efficient way to update / replace a specific value of a dask array in python?
So I have a dask array of integers (1 x 8192) and I want to find an efficient way to replace a specific value.
This is the code I am currently using, which is very slow, because dask is immutable, so ...
2
votes
2
answers
240
views
How to handle large xarray/dask datasets to minimize computation time or running out of memory to store as yearly files (ERA5 dataset)
Currently, I am using ERA5-land data to calculate wind related variables. While I am able to calculate what I want, I do struggle with an efficient implementation to lift this heavy data in a feasible ...
0
votes
0
answers
28
views
read parquet file in dask and convert them to correct numpy shape
I am reading a parquet file in dask and trying to reshape it to how I want it, but it seems rather impossible (I am quite new to dask too).
So, I have a parquet file which has some 8M x 384d numpy ...
0
votes
0
answers
83
views
How to speed up interpolation in dask
I have a piece of data code that performs interpolation on a large number of arrays.
This is extremely quick with numpy, but:
The data the code will work with in reality will often not fit in memory
...
0
votes
0
answers
54
views
How do I use numpys interpolation with xarrays
I have some code that does a linear interpolation on data.
In the case where the data is numpy arrays it works as I would expect.
amplitude = np.interp(reflected_times, times, trace)
gives (zoomed in ...
0
votes
0
answers
28
views
Dask query on dates columns
I am trying to filter a huge dask.DataFrame (~800k lines and 30 cols). I want to use the dask.query function of dask.
start_date = np.datetime64(start_date, 'ns')
end_date = np.datetime64(end_date, '...
1
vote
1
answer
52
views
How to sum radially in a dask array?
I am trying to radially sum the values of a dask array where I retain the chuncked data and sum them for each radius. It may be useful to also normalize the sum to the total number of "pixels&...
1
vote
1
answer
52
views
Why dask shows smaller size than the actual size of the data (numpy array)?
Dask shows slightly smaller size than the actual size of a numpy array. Here is an example of a numpy array that is exactly 32 Mb:
import dask as da
import dask.array
import numpy as np
shape = (1000,...
1
vote
0
answers
111
views
Load, process and save larger-than-memory array using dask
I have a very large covariance matrix (480,000 x 480,000) stored on disk in a binary format. I want to compute a corresponding whitening matrix, for which I need to compute the SVD of the covariance ...
2
votes
0
answers
53
views
Maximum value composite in xarray with dask
I'm trying to perform temporal compositing on multivariate data cubes. The idea, illustrated by the reprex below is for each temporal aggregation (input data have a frequency of 5 days, to be ...
0
votes
0
answers
337
views
Typing when passing xarray DataArray objects to numpy ufuncs
I have a function with type annotations that looks like the following:
import xarray as xr
import numpy as np
def compute_relative_azimuth(sat_azi: xr.DataArray, sun_azi: xr.DataArray) -> xr....
0
votes
1
answer
108
views
Appending and Inserting with Dask Arrays giving mismatch between chunks and shape
I have a Dask array with shape (1001,256,1,256) (data over 1001 timesteps with len(x)=len(z)=256 and len(y)=1).
I need to pad the x-dimension with arrays of shape (1001,2,1,256), which I'm attempting ...
0
votes
1
answer
40
views
How to use Dask to parallelize iterating and updating numpy array
I have an extremely large distance matrix that I need to iterate through each value and update the distance if a condition is true.
Here is my Pandas/Numpy code chunk:
dist_mat = pd.read_csv()
...
3
votes
1
answer
70
views
TypeError when running compute that includes map_blocks and reduce
I am having difficulty diagnosing the cause of the error. My code involves running a convolution (with map_blocks) over some arrays if they belong to the same group of variables, otherwise just record ...
0
votes
1
answer
36
views
How to label in dask dataframe with multi-conditions?
In pandas, I can tag in a new column with multi-conditions on different columns used by np.where(), for example:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'],
'Sex':...
0
votes
1
answer
84
views
Subset/Slice dask array using boolean array of lower dimension
Using numpy I can subset a 3D array using a 2D 'mask'. The same returns an IndexError with dask arrays. Is there any way to reproduce that numpy behaviour below using dask?
import numpy as np
import ...
1
vote
0
answers
62
views
Recombine arrays obtained from subsetting on some of the dimensions of original array
I have a 3-dim array, which I subset based on 2 of the 3 dimensions
import dask.array as da
import numpy as np
np.random.seed(40)
test_arr = np.random.normal(size=(2,3,4))
array([[[-0.6075477 , -0....
1
vote
0
answers
50
views
Write huge xarray dataset physically reorganized by `MultiIndex` to disk
When collapsing xarray dimensions into MultiIndex, merely the index is changed, leaving the underlying data as is.
This new data organisation can then be reflected in the underlying memory the data ...
1
vote
1
answer
112
views
Populating large matrix with values
I have a 100K by 12 by 100K matrix that I need to populate with computation results. I tried creating it using numpy.empty but got a memory error.
So I turned to dask instead. I'm able to create the ...
0
votes
0
answers
156
views
error of adding a new column to dask cudf data frame from a 2-d numpy.darray
I would like to assign a new column to a dask cudf data frame from Jupyter notebook.
The new column is a 2-dimension numpy.ndarray.
My code:
import cudf
import dask_cudf
import numpy as np
from random ...
1
vote
1
answer
56
views
numpy array.all() solution for multidimensional array where array.all(axis=1).all(axis=1) gives desired result
I have a multidimensional NumPy-like array, (I'm using Dask, but this applies to NumPy as Dask mimics that API) that derives from an array of 1592 images:
a:
array([[[ True, True, True, ..., True, ...
0
votes
1
answer
610
views
When is xarrays `xr.apply_ufunc(...dask='parallelized')` fast?
I open data from the ERA5 Google Cloud Zarr archive. I do some refactoring (change time resolution, select Northern Hemisphere only, etc.), where the operations are applied on dask data.
This is how ...
1
vote
1
answer
276
views
What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?
I have a Numpy array on disk, bigger than my available ram.
I can load it as a memory-map and use it without problem:
a = np.memmap(filename, mode='r', shape=shape, dtype=dtype)
Further on, I can ...
6
votes
1
answer
178
views
Computing a norm in a loop slows down the computation with Dask
I was trying to implement a conjugate gradient algorithm using Dask (for didactic purposes) when I realized that the performance were way worst that a simple numpy implementation.
After a few ...
1
vote
0
answers
35
views
How do I use Numpy views when doing scalable statistics bootstrapping
I have a large dataset that I process using xarray+dask for scalability. These libraries work great for all of my calculations, except for one. The final step is to perform some statistics ...
0
votes
0
answers
88
views
Runtime warning plot
When I tried to plot some data using the following line of code: ndvi.mean(['x', 'y']).plot.line('b-^', figsize=(11,4))
I got a lot of warning, like: runtime warning
Here is my config:
Windows
Python ...
1
vote
2
answers
913
views
Dask/pandas apply function and return multiple rows
I'm trying to return a dataframe from the dask map_partitions function. The example code I provided returns a 2 row dataframe in the function. However only 1 row is shown in the end result. Which is ...
1
vote
0
answers
262
views
Dask explode function similar to pandas
I have a dataframe that looks like this
import numpy as np
import pandas as pd
df = pd.DataFrame({"ph_number" : ['1234','2345','1234','1234','2345','1234','2345'],
"...
0
votes
1
answer
80
views
fastest way to fit a quadratic polynomial to rolling window values using dask?
I have a large dataset of 36k x 3k (rows, columns), I want to fit a quadratic polynomial to the values of a 1D rolling window (size=n) centered at each value along every column. I know this a very ...
2
votes
0
answers
923
views
How to generate a correlation matrix of a dataset with a large file size in Python?
I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the ...
0
votes
0
answers
209
views
Cannot start dask client
When I try and initiate a dask distributed cluster with:
from dask.distributed import Client, progress
client = Client(threads_per_worker=1, n_workers=2)
client
I get the following error:
...
-1
votes
3
answers
38
views
Find out which rows of one 2D numpy array are represented in another 2D numpy array
I have two arrays :
a = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
[13, 14, 15]]
b = [[1, 2, 3],
[4, 5, 6],
[13, 14, 15]]
And I want to find out which rows of ...
1
vote
0
answers
291
views
Converting dask array using np.asarry is very slow
I have a folder of about 150 large images (40000x30000x3) ~400MB each, and I want to validate ROIs from an imaging analysis. I was looking to store the file information in a dask array and then index ...
0
votes
2
answers
896
views
Dask slower than numpy with one chunk
I am a new dask user and I'm trying to run the function dot inside my program. I noticed that the function dot of dask is slower than its numpy version even when I use only one chunk in the whole ...
0
votes
1
answer
732
views
How can I handle large data in memory using python?
I have a data set that is larger than my memory. In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size. Usually i get around this by just dealing one file ...
3
votes
4
answers
686
views
How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?
I've a data file that looks like this:
58f0965a62d62099f5c0771d35dbc218 0.868632614612579 [0.028979932889342308, 0.004080114420503378, 0.03757167607545853] [-0.006008833646774292, -...
1
vote
0
answers
339
views
Get size of dask array without eager loading it
With certain processes, dask does not know the output array size/shape a priori. Is there a way to access this value without computing the entire array in memory?
See for instance:
import dask.array ...
2
votes
2
answers
688
views
big data dataframe from an on-disk mem-mapped binary struct format from python, pandas, dask, numpy
I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these ...
2
votes
2
answers
2k
views
How to use Dask.Array.From_Zarr to open a zarr file on Dask?
I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time') :
but when I try on one coordinates such as time it works:
...
2
votes
1
answer
319
views
Use dask for an out of core conversion of iterable.product into a numpy/dask array (create a matrix of every permutation with repetition)
I am looking to create a matrix (numpy array of numpy arrays) of every permutation with repetition (I want to use it for matrix multiplication later on). Currently the way I am doing it, I first ...
3
votes
2
answers
1k
views
Applying a function to each timestep in an xarray.Dataset, and return lazy Dask array outputs
I have an xarray.Dataset with two 1D variables sun_azimuth and sun_elevation with multiple timesteps along the time dimension:
import xarray as xr
import numpy as np
ds = xr.Dataset(
data_vars={
...
2
votes
1
answer
469
views
Looping through Dask array made of npy memmap files increases RAM without ever freeing it
Context
I am trying to load multiple .npy files containing 2D arrays into one big 2D array to process it by chunk later.All of this data is bigger than my RAM so I am using the memmap storage/loading ...
0
votes
1
answer
200
views
Dask map_blocks is running earlier with a bad result for overlap and nested procedures
I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped ...
1
vote
0
answers
249
views
OSError while computing large amount of data with dask
I have a large amount of data (*.grib) that I load using xarray and dask.
To make it simple, my data is record of world temperature during the month of january 2022. Data are collected multiple time ...
1
vote
1
answer
673
views
Creating a new column in dask (arrays ,list)
What would be the equivalent of transforming this to a dask format
df['x'] = np.where(df['y'].isin(a_list), 'yes', 'no')
The df will be a dask dataframe with n partitions and a_list is a just a list ...