Newest 'dask' Questions

0 votes

0 answers

6 views

Initializing a local cluster in Dask takes forever

I'm trying out some things with Dask for the first time, and while I had it running a few weeks ago, I now find that I can't get the LocalCluster initiated. I've cut if off after running 30 minutes at ...

MKJ

308

asked 16 hours ago

0 votes

1 answer

28 views

Pandas read_excel with nrows and skiprows and lazy-loading?

I am searching for ways to read an .xlsx file as chunks of dataframes, instead of loading the whole thing into memory. What exactly happens when I pd.read_excel(nrows, skiprows, usecols) ? Is the ...

zensei

1

asked 23 hours ago

0 votes

0 answers

28 views

dask_cuda problem with Local CUDA Cluster

I am trying to get this code to work and then use it to train various models on two gpu's: from dask_cuda import LocalCUDACluster from dask.distributed import Client if __name__ == "__main__&...

Danilo Caputo

1

asked yesterday

0 votes

0 answers

19 views

Finding a way to iterate using the input of two xarray dataarrays when chunked

I am developing a relatively large model using xarray and therefore want to make use of chunks. Most of my operations run a lot faster when chunked but there is one that keeps running (a lot) slower ...

Rogier Westerhoff

1

asked 2 days ago

1 vote

1 answer

34 views

Sampling a parquet file with only one row_group

I am working on a huge parquet file containing over 30 million rows. I only need a small portion of it and wish pick a number of randomly selected rows. When i inspect the file's metadata, there's ...

CHADHA ESSID

11

asked Dec 9 at 9:21

0 votes

0 answers

7 views

What is ToParquetBarrier in Dask Dataframe

I'm trying to code some heavy data merging tool and I would like to handle more data than fits into my ram at the same time. It seems like the to_parquet function has a function called "...

cagcoach

675

asked Dec 6 at 10:23

0 votes

0 answers

23 views

Processing the same array, dask.array is too slow compared to numpy.array

import BWStest as bws import numpy as np from skimage.measure import label import dask.array from tqdm import tqdm CalWin = [7,25] stack = [] thershold = 0.05 for i in range(5): image = np.random....

user19298695

1

asked Dec 4 at 11:57

0 votes

1 answer

50 views

Error when using xarray.apply_ufunc on a chunked xarray DataArray

Hello I am trying to sort the data inside some netCDF4 (.nc) files into bins as efficiently as possible. I am currently trying this with xarray and NumPy's digitize function. Since I want to process a ...

Innocuous Rift

65

asked Nov 28 at 16:50

1 vote

1 answer

27 views

How to nest dask.delayed functions within other dask.delayed functions

I am trying to learn dask, and have created the following toy example of a delayed pipeline. +-----+ +-----+ +-----+ | baz +--+ bar +--+ foo | +-----+ +-----+ +-----+ So baz has a dependency on ...

Steve Lorimer

28.5k

asked Nov 27 at 9:54

0 votes

1 answer

33 views

Dask distributed pause task to wait for subtask - how-to, or bad practice?

I am running tasks using client.submit thus: from dask.distributed import Client, get_client, wait, as_completed # other imports zip_and_upload_futures = [ client.submit(zip_and_upload, id, path, ...

Dave

420

asked Nov 26 at 14:51

0 votes

1 answer

32 views

Storing larger-than-memory xarray dataset to zarr using `dask.delayed` without blowing up memory

I'm trying to use dask to conduct larger-than-memory processing in xarray. Concretely, I'm trying to: Concatenate several NetCDF files (on the same geo grid, same variables) by time Regrid them to a ...

ks905383

180

asked Nov 25 at 22:05

0 votes

0 answers

27 views

Is there a way apart from chunking of Dask to avoid excessive RAM usage due to large dataset?

I am using the following code to calculate some speed-related variables for a dataset that consists of about 200 million rows. In order to avoid Memory issues, I am using chunking. import pandas as pd ...

Themos Papadopoulos

1

asked Nov 23 at 12:51

0 votes

0 answers

40 views

Why is joblib's Parallel delayed faster than dasks map block and compute()

This question is possibly related to this one. I have 4D numpy array and would like to apply a function to each 2D slice across the first two dimensions. I have implemented the analysis for both dask ...

Johannes Wiesner

1,277

asked Nov 22 at 15:39

1 vote

0 answers

20 views

How to make xarray.to_zarr write the same way as dask.array.to_zarr?

I have a 3D xarray DataArray that I've saved to to zarr using DataArray.to_zarr(). However, the resulting format is not readable by my tools, in this case napari. The structure of the zarr has one ...

user8188435

361

asked Nov 22 at 8:38

0 votes

0 answers

65 views

How to apply a function to each 2D slice of a 4D Numpy array in parallel with Dask without running out of RAM?

I want to apply a function to each 2D slice of a 4D Numpy array using Dask. The output should be a 2D matrix (the function applied to each 2D slice returns a single value). I would like to do this in ...

Johannes Wiesner

1,277

asked Nov 20 at 11:14

0 votes

0 answers

18 views

Why does xarray.groupby with Dask-backed datasets fail on large data?

I'm working with a large Dask-backed Xarray dataset loaded from a Zarr store. The dataset has dimensions along a dim1 axis and includes a coordinate (group_key) and a data variable (data_var). Both ...

Gary Frewin

555

asked Nov 18 at 17:06

0 votes

0 answers

24 views

How does parquet read happens in the Modin distributed environment?

I am using Modin for reading the parquet file through panda. I have two clusters of 256 GB of RAM but even though duckdb on the single machine takes 50 seconds to go group by. Modin is going infinite ...

Bhaskar Dabhi

909

asked Nov 18 at 14:21

1 vote

1 answer

252 views

Django + Dask integration: usage and progress?

About performance & best practice Note, the entire code for the question below is public on Github. Feel free to check out the project! https://github.com/b-long/moose-dj-uv/pull/3 I'm trying to ...

blong

2,711

asked Nov 17 at 21:54

0 votes

0 answers

20 views

Dask worker running long calls

The code running on the dask worker calls asyncio.run() and proceeds to exectue a series of async calls (on the worker running event_loop) that gather data, and then run a small computation. This ...

Dirich

442

asked Nov 17 at 19:51

0 votes

0 answers

22 views

Dask concat issue with DataFrames of different lengths, pandas works fine

I’m trying to use dask to read at least 12 files with approx. 700 MB, each. The files were created on the same way, with only one column and with exactly the same number of lines (about 43 million ...

user28275839

1

asked Nov 13 at 13:13

0 votes

0 answers

22 views

Dask-distributed LocalCluster spinup hangs only in non-interactive execution

I want to start and connect to a dask.distributed LocalCluster from within a python script. However, I get different behavior depending on the way in which I invoke the same lines of code. The ...

Charles Davis

164

asked Nov 12 at 23:12

0 votes

1 answer

33 views

DASK to_csv() problems due to memory

I'm cleaning my text data and afterwards want to save it to csv. Defined cleaning functions work fine, but when to_csv() part comes, here come the problems as well. Maybe someone have faced similar ...

Rūta Kondrot

1

asked Nov 11 at 10:13

0 votes

1 answer

18 views

Dask, how to drop row with specific value in a variable into lazy computing

I'm trying to learn to walkaround with dask for my machine learning project. My data set is too big to play with Pandas, so I must stay in lazy loading. here a smal sample to show how it is set up: I ...

Jonathan Roy

431

asked Nov 5 at 21:01

0 votes

0 answers

29 views

Dask running very slow

I am new to using Dask. I have a small piece of code like below. In this code, I am triggering computation on. el.compute() Is this the correct way to use Dask's compute function? My code runs very ...

noobee

57

asked Nov 2 at 5:03

0 votes

0 answers

25 views

Importing SQL Table from Snowflake into Jupyter using Dask

I have an SQL Table in Snowflake,100K rows and 15 Columns. I want to import this table into my Jupyter notebook using Dask for further analysis. Primarily doing this a form of practice since I am new ...

OscarOz

3

asked Nov 1 at 14:08

0 votes

1 answer

64 views

Understanding if current process is part of a multiprocessing pool with --multiprocessing-fork

I need to find a way for a python process to figure out if it was launched as part of a multiprocessing pool. I am using dask to parallelize calculations, using dask.distributed.LocalCluster. For UX ...

pnjun

151

asked Oct 29 at 16:39

0 votes

0 answers

21 views

ModuleNotFoundError: no module named ... when submitting job to client with an object from that module

I have a weird issue I am not able to solve. I have created a Local dask client, and I have a preload script that imports local modules. The preload script contains a list of paths, and runs sys.path....

Rezzy

131

asked Oct 29 at 1:41

0 votes

2 answers

79 views

how do I append the output of a dask_cudf apply function to the original dask_cudf?

I am applying a function (e.g. letter frequency) to a dask_cudf dataframe that consists of a single column of words of fixed length. I am trying to merge the output or append the output into the ...

Austin So

1

asked Oct 28 at 16:23

0 votes

1 answer

88 views

Optimizing Pandas GroupBy and Aggregation on Large Datasets with Multiple Custom Functions

I'm working with a large Pandas dataframe (about 30.5 million rows) where I need to group by multiple columns and apply different custom aggregation functions. However, the performance is currently a ...

Paras

17

asked Oct 24 at 6:16

2 votes

0 answers

25 views

How to speed up a dask delayed compute of a large dictionary?

I need to run a random forest classifier that I've put into a function ~ 10,000 times - because I sample randomly each time. I am trying to use dask delayed on a slurm-scheduled HPC cluster. My script ...

Lexical

21

asked Oct 23 at 14:29

1 vote

0 answers

70 views

Working with larger than memory data in numpy

I am working on a project that involves larger than memory numpy 3 dimensional arrays. The project will be deployed with AWS lambda. I am faced with two design choices a) Re-write large parts of the ...

Femi King

21

asked Oct 19 at 18:39

0 votes

1 answer

29 views

Common workflow for using Dask on HPC systems

I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m ...

Joseph Pena

360

asked Oct 17 at 14:34

0 votes

0 answers

26 views

Run only a single chunk's worth of data without creating a Dask graph of all my chunks?

# Template xarray based on Earth Engine import numpy as np import pandas as pd import dask import xarray as xr # Define the dimensions time = pd.date_range("2020-12-29T18:57:32.281000", ...

Adriano Matos

387

asked Oct 17 at 0:31

1 vote

1 answer

74 views

In python xarray, how do I create and subset lazy variables without loading the whole dataarray?

I am trying to create a python function that opens a remote dataset (in a opendap server) using xarray and automatically creates new variables lazily. A use case would be to calculate magnitude and ...

Marcelo Andrioni

436

asked Oct 11 at 18:56

-1 votes

1 answer

47 views

Is there an efficient way to update / replace a specific value of a dask array in python?

So I have a dask array of integers (1 x 8192) and I want to find an efficient way to replace a specific value. This is the code I am currently using, which is very slow, because dask is immutable, so ...

Illuminator

1

asked Oct 11 at 8:46

0 votes

0 answers

43 views

How to fix memory errors merging large dask dataframes?

I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues. I used to use pandas to join these together ...

ifightfortheuserz

1

asked Oct 11 at 8:41

2 votes

1 answer

71 views

xarray and dask: efficiently processing a large netcdf file

I am trying to do a simple calculation on a very large netcdf file, and am struggling to speed it up -- probably because I program primarily in julia and R. I think xarray/dask are the best approach ...

j lev

23

asked Oct 10 at 3:29

2 votes

1 answer

124 views

How to accelerate getting points within distance using two DataFrames?

I have two DataFrames (df and locations_df), and both have longitude and latitude values. I'm trying to find the df's points within 2 km of each row of locations_df. I tried to vectorize the function, ...

zxdawn

981

asked Oct 7 at 20:02

0 votes

1 answer

36 views

dask map_partitions strange behaviour

When I create a dask dataframe from pandas with 1 partition, then call map_partitions() on it, it seems to be called twice. If I have 5 partitions, it is called 6 times. In general, the function is ...

Gururaj Deshpande

3

asked Oct 3 at 16:59

0 votes

1 answer

57 views

How do I set the timeout parameter in dask scheduler

I was trying to run dask-distributed to distribute some big computation in a slurm cluster. I was always getting a "TimeoutError: No valid workers found" message (this came from line 6130 in ...

Fidel

37

asked Oct 2 at 12:17

0 votes

0 answers

27 views

How to Serialize and read metadata in Dask Dataframe

I'm trying to pre-compute dask divisions and categories for certain columns in dask dataframe, then save data as new dataframe and reuse it down the road. Issue I encountered is that when I read ...

imachabeli

66

asked Sep 28 at 0:27

0 votes

0 answers

23 views

Does dask worker heartbeat when loaded with tasks?

When dask workers have been assigned some tasks by schedulers, do the workers send heartbeats to scheduler or while performing tasks then don't heart beat. I am thinking that while working dask-...

Chaitanya Kshirsagar

1

asked Sep 27 at 18:24

0 votes

0 answers

39 views

Null values in log file from Python logging module when used with Dask

I have been trying to setup logging using the logging module in a Python script, and I have got it working properly. It can now log to both the console and a log file. But if fails when I setup a Dask ...

RogUE

363

asked Sep 22 at 10:28

0 votes

0 answers

19 views

Does the OS version need to match for Dask client and scheduler/workers?

I have a Dask cluster running on Fargate/ECS. On an EC2 instance, I am running the client that submits a job (subset and average along particular dimensions on a Zarr dataset) to the cluster. I am ...

dmn

23

asked Sep 21 at 20:57

0 votes

0 answers

74 views

Progress bar on xarray.open_mfdataset?

I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that ...

Tor

757

asked Sep 19 at 11:03

0 votes

0 answers

18 views

Use case for dask to spread model execution across machines

I'd like to know if this is a good use case for dask and how to implement it. I have a rest API endpoint that will be receiving a POST request with some parameters in a JSON object. We will be ...

Rezzy

131

asked Sep 18 at 12:49

0 votes

0 answers

16 views

Detecting schema inconsistencies in json data with Dask bag

I have spent a few days trying to battle an issue with json data and dask, so I finally resort to asking here. I have a collection of json files who should all have the same data structure containing ...

Whynote

992

asked Sep 18 at 11:56

0 votes

1 answer

50 views

How can I keep Dask workers busy when processing large datasets to prevent them from running out of tasks?

I'm trying to process a large dataset (around 1 million tasks) using Dask distributed computing in Python. (I am getting data from a database to process it, and I am retriving around 1M rows). Here I ...

Polymood

461

asked Sep 16 at 13:33

0 votes

0 answers

33 views

Memory issues when using Dask

I have just started using 'dask', and I am running into some memory issues. I would like to share my problem along with a dummy code snippet that illustrates what I’ve done. My goal is to read a large ...

jei L

33

asked Sep 14 at 9:54

0 votes

1 answer

56 views

Dask dataframe running parallel and partitioned by columns

I have a dataframe with multiple companies and countries' data that I am trying to transform in parallel using a function. The data takes a format like this but is much larger and with many more ...

AndyMcCammont

45

asked Sep 10 at 16:04

Collectives™ on Stack Overflow

Related Tags