Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
0 answers
23 views

Processing the same array, dask.array is too slow compared to numpy.array

import BWStest as bws import numpy as np from skimage.measure import label import dask.array from tqdm import tqdm CalWin = [7,25] stack = [] thershold = 0.05 for i in range(5): image = np.random....
user19298695's user avatar
0 votes
0 answers
40 views

Why is joblib's Parallel delayed faster than dasks map block and compute()

This question is possibly related to this one. I have 4D numpy array and would like to apply a function to each 2D slice across the first two dimensions. I have implemented the analysis for both dask ...
Johannes Wiesner's user avatar
0 votes
0 answers
65 views

How to apply a function to each 2D slice of a 4D Numpy array in parallel with Dask without running out of RAM?

I want to apply a function to each 2D slice of a 4D Numpy array using Dask. The output should be a 2D matrix (the function applied to each 2D slice returns a single value). I would like to do this in ...
Johannes Wiesner's user avatar
0 votes
1 answer
88 views

Optimizing Pandas GroupBy and Aggregation on Large Datasets with Multiple Custom Functions

I'm working with a large Pandas dataframe (about 30.5 million rows) where I need to group by multiple columns and apply different custom aggregation functions. However, the performance is currently a ...
Paras's user avatar
  • 17
1 vote
0 answers
70 views

Working with larger than memory data in numpy

I am working on a project that involves larger than memory numpy 3 dimensional arrays. The project will be deployed with AWS lambda. I am faced with two design choices a) Re-write large parts of the ...
Femi King's user avatar
-1 votes
1 answer
47 views

Is there an efficient way to update / replace a specific value of a dask array in python?

So I have a dask array of integers (1 x 8192) and I want to find an efficient way to replace a specific value. This is the code I am currently using, which is very slow, because dask is immutable, so ...
Illuminator's user avatar
2 votes
2 answers
240 views

How to handle large xarray/dask datasets to minimize computation time or running out of memory to store as yearly files (ERA5 dataset)

Currently, I am using ERA5-land data to calculate wind related variables. While I am able to calculate what I want, I do struggle with an efficient implementation to lift this heavy data in a feasible ...
Dominik N.'s user avatar
0 votes
0 answers
28 views

read parquet file in dask and convert them to correct numpy shape

I am reading a parquet file in dask and trying to reshape it to how I want it, but it seems rather impossible (I am quite new to dask too). So, I have a parquet file which has some 8M x 384d numpy ...
JohnJ's user avatar
  • 7,056
0 votes
0 answers
83 views

How to speed up interpolation in dask

I have a piece of data code that performs interpolation on a large number of arrays. This is extremely quick with numpy, but: The data the code will work with in reality will often not fit in memory ...
abinitio's user avatar
  • 805
0 votes
0 answers
54 views

How do I use numpys interpolation with xarrays

I have some code that does a linear interpolation on data. In the case where the data is numpy arrays it works as I would expect. amplitude = np.interp(reflected_times, times, trace) gives (zoomed in ...
abinitio's user avatar
  • 805
0 votes
0 answers
28 views

Dask query on dates columns

I am trying to filter a huge dask.DataFrame (~800k lines and 30 cols). I want to use the dask.query function of dask. start_date = np.datetime64(start_date, 'ns') end_date = np.datetime64(end_date, '...
jotierm's user avatar
  • 16
1 vote
1 answer
52 views

How to sum radially in a dask array?

I am trying to radially sum the values of a dask array where I retain the chuncked data and sum them for each radius. It may be useful to also normalize the sum to the total number of "pixels&...
Max Neveau's user avatar
1 vote
1 answer
52 views

Why dask shows smaller size than the actual size of the data (numpy array)?

Dask shows slightly smaller size than the actual size of a numpy array. Here is an example of a numpy array that is exactly 32 Mb: import dask as da import dask.array import numpy as np shape = (1000,...
Ress's user avatar
  • 780
1 vote
0 answers
111 views

Load, process and save larger-than-memory array using dask

I have a very large covariance matrix (480,000 x 480,000) stored on disk in a binary format. I want to compute a corresponding whitening matrix, for which I need to compute the SVD of the covariance ...
smashwhat's user avatar
2 votes
0 answers
53 views

Maximum value composite in xarray with dask

I'm trying to perform temporal compositing on multivariate data cubes. The idea, illustrated by the reprex below is for each temporal aggregation (input data have a frequency of 5 days, to be ...
Loïc Dutrieux's user avatar
0 votes
0 answers
337 views

Typing when passing xarray DataArray objects to numpy ufuncs

I have a function with type annotations that looks like the following: import xarray as xr import numpy as np def compute_relative_azimuth(sat_azi: xr.DataArray, sun_azi: xr.DataArray) -> xr....
djhoese's user avatar
  • 3,647
0 votes
1 answer
108 views

Appending and Inserting with Dask Arrays giving mismatch between chunks and shape

I have a Dask array with shape (1001,256,1,256) (data over 1001 timesteps with len(x)=len(z)=256 and len(y)=1). I need to pad the x-dimension with arrays of shape (1001,2,1,256), which I'm attempting ...
Dave's user avatar
  • 420
0 votes
1 answer
40 views

How to use Dask to parallelize iterating and updating numpy array

I have an extremely large distance matrix that I need to iterate through each value and update the distance if a condition is true. Here is my Pandas/Numpy code chunk: dist_mat = pd.read_csv() ...
Matthew's user avatar
  • 23
3 votes
1 answer
70 views

TypeError when running compute that includes map_blocks and reduce

I am having difficulty diagnosing the cause of the error. My code involves running a convolution (with map_blocks) over some arrays if they belong to the same group of variables, otherwise just record ...
matsuo_basho's user avatar
  • 3,010
0 votes
1 answer
36 views

How to label in dask dataframe with multi-conditions?

In pandas, I can tag in a new column with multi-conditions on different columns used by np.where(), for example: import pandas as pd df = pd.DataFrame({'Name':['A','B','C'], 'Sex':...
EZGAME_Herry's user avatar
0 votes
1 answer
84 views

Subset/Slice dask array using boolean array of lower dimension

Using numpy I can subset a 3D array using a 2D 'mask'. The same returns an IndexError with dask arrays. Is there any way to reproduce that numpy behaviour below using dask? import numpy as np import ...
Loïc Dutrieux's user avatar
1 vote
0 answers
62 views

Recombine arrays obtained from subsetting on some of the dimensions of original array

I have a 3-dim array, which I subset based on 2 of the 3 dimensions import dask.array as da import numpy as np np.random.seed(40) test_arr = np.random.normal(size=(2,3,4)) array([[[-0.6075477 , -0....
matsuo_basho's user avatar
  • 3,010
1 vote
0 answers
50 views

Write huge xarray dataset physically reorganized by `MultiIndex` to disk

When collapsing xarray dimensions into MultiIndex, merely the index is changed, leaving the underlying data as is. This new data organisation can then be reflected in the underlying memory the data ...
Post Self's user avatar
  • 1,558
1 vote
1 answer
112 views

Populating large matrix with values

I have a 100K by 12 by 100K matrix that I need to populate with computation results. I tried creating it using numpy.empty but got a memory error. So I turned to dask instead. I'm able to create the ...
matsuo_basho's user avatar
  • 3,010
0 votes
0 answers
156 views

error of adding a new column to dask cudf data frame from a 2-d numpy.darray

I would like to assign a new column to a dask cudf data frame from Jupyter notebook. The new column is a 2-dimension numpy.ndarray. My code: import cudf import dask_cudf import numpy as np from random ...
mtnt's user avatar
  • 31
1 vote
1 answer
56 views

numpy array.all() solution for multidimensional array where array.all(axis=1).all(axis=1) gives desired result

I have a multidimensional NumPy-like array, (I'm using Dask, but this applies to NumPy as Dask mimics that API) that derives from an array of 1592 images: a: array([[[ True, True, True, ..., True, ...
Dave's user avatar
  • 420
0 votes
1 answer
610 views

When is xarrays `xr.apply_ufunc(...dask='parallelized')` fast?

I open data from the ERA5 Google Cloud Zarr archive. I do some refactoring (change time resolution, select Northern Hemisphere only, etc.), where the operations are applied on dask data. This is how ...
jspaeth's user avatar
  • 335
1 vote
1 answer
276 views

What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

I have a Numpy array on disk, bigger than my available ram. I can load it as a memory-map and use it without problem: a = np.memmap(filename, mode='r', shape=shape, dtype=dtype) Further on, I can ...
Pawel's user avatar
  • 1,366
6 votes
1 answer
178 views

Computing a norm in a loop slows down the computation with Dask

I was trying to implement a conjugate gradient algorithm using Dask (for didactic purposes) when I realized that the performance were way worst that a simple numpy implementation. After a few ...
SteP's user avatar
  • 164
1 vote
0 answers
35 views

How do I use Numpy views when doing scalable statistics bootstrapping

I have a large dataset that I process using xarray+dask for scalability. These libraries work great for all of my calculations, except for one. The final step is to perform some statistics ...
krokosik's user avatar
  • 147
0 votes
0 answers
88 views

Runtime warning plot

When I tried to plot some data using the following line of code: ndvi.mean(['x', 'y']).plot.line('b-^', figsize=(11,4)) I got a lot of warning, like: runtime warning Here is my config: Windows Python ...
maria's user avatar
  • 1
1 vote
2 answers
913 views

Dask/pandas apply function and return multiple rows

I'm trying to return a dataframe from the dask map_partitions function. The example code I provided returns a 2 row dataframe in the function. However only 1 row is shown in the end result. Which is ...
Sam's user avatar
  • 358
1 vote
0 answers
262 views

Dask explode function similar to pandas

I have a dataframe that looks like this import numpy as np import pandas as pd df = pd.DataFrame({"ph_number" : ['1234','2345','1234','1234','2345','1234','2345'], "...
Aayush Gupta's user avatar
0 votes
1 answer
80 views

fastest way to fit a quadratic polynomial to rolling window values using dask?

I have a large dataset of 36k x 3k (rows, columns), I want to fit a quadratic polynomial to the values of a 1D rolling window (size=n) centered at each value along every column. I know this a very ...
Marc's user avatar
  • 140
2 votes
0 answers
923 views

How to generate a correlation matrix of a dataset with a large file size in Python?

I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the ...
Michael's user avatar
  • 147
0 votes
0 answers
209 views

Cannot start dask client

When I try and initiate a dask distributed cluster with: from dask.distributed import Client, progress client = Client(threads_per_worker=1, n_workers=2) client I get the following error: ...
dbschwartz's user avatar
-1 votes
3 answers
38 views

Find out which rows of one 2D numpy array are represented in another 2D numpy array

I have two arrays : a = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]] b = [[1, 2, 3], [4, 5, 6], [13, 14, 15]] And I want to find out which rows of ...
Ali Silberman's user avatar
1 vote
0 answers
291 views

Converting dask array using np.asarry is very slow

I have a folder of about 150 large images (40000x30000x3) ~400MB each, and I want to validate ROIs from an imaging analysis. I was looking to store the file information in a dask array and then index ...
nal225's user avatar
  • 11
0 votes
2 answers
896 views

Dask slower than numpy with one chunk

I am a new dask user and I'm trying to run the function dot inside my program. I noticed that the function dot of dask is slower than its numpy version even when I use only one chunk in the whole ...
Amed's user avatar
  • 3
0 votes
1 answer
732 views

How can I handle large data in memory using python?

I have a data set that is larger than my memory. In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size. Usually i get around this by just dealing one file ...
jsp's user avatar
  • 173
3 votes
4 answers
686 views

How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?

I've a data file that looks like this: 58f0965a62d62099f5c0771d35dbc218 0.868632614612579 [0.028979932889342308, 0.004080114420503378, 0.03757167607545853] [-0.006008833646774292, -...
alvas's user avatar
  • 122k
1 vote
0 answers
339 views

Get size of dask array without eager loading it

With certain processes, dask does not know the output array size/shape a priori. Is there a way to access this value without computing the entire array in memory? See for instance: import dask.array ...
Loïc Dutrieux's user avatar
2 votes
2 answers
688 views

big data dataframe from an on-disk mem-mapped binary struct format from python, pandas, dask, numpy

I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these ...
Jonathan Shore's user avatar
2 votes
2 answers
2k views

How to use Dask.Array.From_Zarr to open a zarr file on Dask?

I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time') : but when I try on one coordinates such as time it works: ...
Severus's user avatar
  • 35
2 votes
1 answer
319 views

Use dask for an out of core conversion of iterable.product into a numpy/dask array (create a matrix of every permutation with repetition)

I am looking to create a matrix (numpy array of numpy arrays) of every permutation with repetition (I want to use it for matrix multiplication later on). Currently the way I am doing it, I first ...
Ivan 's user avatar
  • 73
3 votes
2 answers
1k views

Applying a function to each timestep in an xarray.Dataset, and return lazy Dask array outputs

I have an xarray.Dataset with two 1D variables sun_azimuth and sun_elevation with multiple timesteps along the time dimension: import xarray as xr import numpy as np ds = xr.Dataset( data_vars={ ...
Robbi Bishop-Taylor's user avatar
2 votes
1 answer
469 views

Looping through Dask array made of npy memmap files increases RAM without ever freeing it

Context I am trying to load multiple .npy files containing 2D arrays into one big 2D array to process it by chunk later.All of this data is bigger than my RAM so I am using the memmap storage/loading ...
Tom Moritz's user avatar
0 votes
1 answer
200 views

Dask map_blocks is running earlier with a bad result for overlap and nested procedures

I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped ...
jcfaracco's user avatar
  • 894
1 vote
0 answers
249 views

OSError while computing large amount of data with dask

I have a large amount of data (*.grib) that I load using xarray and dask. To make it simple, my data is record of world temperature during the month of january 2022. Data are collected multiple time ...
Romain's user avatar
  • 44
1 vote
1 answer
673 views

Creating a new column in dask (arrays ,list)

What would be the equivalent of transforming this to a dask format df['x'] = np.where(df['y'].isin(a_list), 'yes', 'no') The df will be a dask dataframe with n partitions and a_list is a just a list ...
waithira's user avatar
  • 340

1
2 3 4 5