Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
6 views

Initializing a local cluster in Dask takes forever

I'm trying out some things with Dask for the first time, and while I had it running a few weeks ago, I now find that I can't get the LocalCluster initiated. I've cut if off after running 30 minutes at ...
MKJ's user avatar
  • 308
0 votes
1 answer
28 views

Pandas read_excel with nrows and skiprows and lazy-loading?

I am searching for ways to read an .xlsx file as chunks of dataframes, instead of loading the whole thing into memory. What exactly happens when I pd.read_excel(nrows, skiprows, usecols) ? Is the ...
zensei's user avatar
  • 1
0 votes
0 answers
28 views

dask_cuda problem with Local CUDA Cluster

I am trying to get this code to work and then use it to train various models on two gpu's: from dask_cuda import LocalCUDACluster from dask.distributed import Client if __name__ == "__main__&...
Danilo Caputo's user avatar
0 votes
0 answers
19 views

Finding a way to iterate using the input of two xarray dataarrays when chunked

I am developing a relatively large model using xarray and therefore want to make use of chunks. Most of my operations run a lot faster when chunked but there is one that keeps running (a lot) slower ...
Rogier Westerhoff's user avatar
1 vote
1 answer
34 views

Sampling a parquet file with only one row_group

I am working on a huge parquet file containing over 30 million rows. I only need a small portion of it and wish pick a number of randomly selected rows. When i inspect the file's metadata, there's ...
CHADHA ESSID's user avatar
0 votes
0 answers
7 views

What is ToParquetBarrier in Dask Dataframe

I'm trying to code some heavy data merging tool and I would like to handle more data than fits into my ram at the same time. It seems like the to_parquet function has a function called "...
cagcoach's user avatar
  • 675
0 votes
0 answers
23 views

Processing the same array, dask.array is too slow compared to numpy.array

import BWStest as bws import numpy as np from skimage.measure import label import dask.array from tqdm import tqdm CalWin = [7,25] stack = [] thershold = 0.05 for i in range(5): image = np.random....
user19298695's user avatar
0 votes
1 answer
50 views

Error when using xarray.apply_ufunc on a chunked xarray DataArray

Hello I am trying to sort the data inside some netCDF4 (.nc) files into bins as efficiently as possible. I am currently trying this with xarray and NumPy's digitize function. Since I want to process a ...
Innocuous Rift's user avatar
1 vote
1 answer
27 views

How to nest dask.delayed functions within other dask.delayed functions

I am trying to learn dask, and have created the following toy example of a delayed pipeline. +-----+ +-----+ +-----+ | baz +--+ bar +--+ foo | +-----+ +-----+ +-----+ So baz has a dependency on ...
Steve Lorimer's user avatar
0 votes
1 answer
33 views

Dask distributed pause task to wait for subtask - how-to, or bad practice?

I am running tasks using client.submit thus: from dask.distributed import Client, get_client, wait, as_completed # other imports zip_and_upload_futures = [ client.submit(zip_and_upload, id, path, ...
Dave's user avatar
  • 420
0 votes
1 answer
32 views

Storing larger-than-memory xarray dataset to zarr using `dask.delayed` without blowing up memory

I'm trying to use dask to conduct larger-than-memory processing in xarray. Concretely, I'm trying to: Concatenate several NetCDF files (on the same geo grid, same variables) by time Regrid them to a ...
ks905383's user avatar
  • 180
0 votes
0 answers
27 views

Is there a way apart from chunking of Dask to avoid excessive RAM usage due to large dataset?

I am using the following code to calculate some speed-related variables for a dataset that consists of about 200 million rows. In order to avoid Memory issues, I am using chunking. import pandas as pd ...
Themos Papadopoulos's user avatar
0 votes
0 answers
40 views

Why is joblib's Parallel delayed faster than dasks map block and compute()

This question is possibly related to this one. I have 4D numpy array and would like to apply a function to each 2D slice across the first two dimensions. I have implemented the analysis for both dask ...
Johannes Wiesner's user avatar
1 vote
0 answers
20 views

How to make xarray.to_zarr write the same way as dask.array.to_zarr?

I have a 3D xarray DataArray that I've saved to to zarr using DataArray.to_zarr(). However, the resulting format is not readable by my tools, in this case napari. The structure of the zarr has one ...
user8188435's user avatar
0 votes
0 answers
65 views

How to apply a function to each 2D slice of a 4D Numpy array in parallel with Dask without running out of RAM?

I want to apply a function to each 2D slice of a 4D Numpy array using Dask. The output should be a 2D matrix (the function applied to each 2D slice returns a single value). I would like to do this in ...
Johannes Wiesner's user avatar
0 votes
0 answers
18 views

Why does xarray.groupby with Dask-backed datasets fail on large data?

I'm working with a large Dask-backed Xarray dataset loaded from a Zarr store. The dataset has dimensions along a dim1 axis and includes a coordinate (group_key) and a data variable (data_var). Both ...
Gary Frewin's user avatar
0 votes
0 answers
24 views

How does parquet read happens in the Modin distributed environment?

I am using Modin for reading the parquet file through panda. I have two clusters of 256 GB of RAM but even though duckdb on the single machine takes 50 seconds to go group by. Modin is going infinite ...
Bhaskar Dabhi's user avatar
1 vote
1 answer
252 views

Django + Dask integration: usage and progress?

About performance & best practice Note, the entire code for the question below is public on Github. Feel free to check out the project! https://github.com/b-long/moose-dj-uv/pull/3 I'm trying to ...
blong's user avatar
  • 2,711
0 votes
0 answers
20 views

Dask worker running long calls

The code running on the dask worker calls asyncio.run() and proceeds to exectue a series of async calls (on the worker running event_loop) that gather data, and then run a small computation. This ...
Dirich's user avatar
  • 442
0 votes
0 answers
22 views

Dask concat issue with DataFrames of different lengths, pandas works fine

I’m trying to use dask to read at least 12 files with approx. 700 MB, each. The files were created on the same way, with only one column and with exactly the same number of lines (about 43 million ...
user28275839's user avatar
0 votes
0 answers
22 views

Dask-distributed LocalCluster spinup hangs only in non-interactive execution

I want to start and connect to a dask.distributed LocalCluster from within a python script. However, I get different behavior depending on the way in which I invoke the same lines of code. The ...
Charles Davis's user avatar
0 votes
1 answer
33 views

DASK to_csv() problems due to memory

I'm cleaning my text data and afterwards want to save it to csv. Defined cleaning functions work fine, but when to_csv() part comes, here come the problems as well. Maybe someone have faced similar ...
Rūta Kondrot's user avatar
0 votes
1 answer
18 views

Dask, how to drop row with specific value in a variable into lazy computing

I'm trying to learn to walkaround with dask for my machine learning project. My data set is too big to play with Pandas, so I must stay in lazy loading. here a smal sample to show how it is set up: I ...
Jonathan Roy's user avatar
0 votes
0 answers
29 views

Dask running very slow

I am new to using Dask. I have a small piece of code like below. In this code, I am triggering computation on. el.compute() Is this the correct way to use Dask's compute function? My code runs very ...
noobee's user avatar
  • 57
0 votes
0 answers
25 views

Importing SQL Table from Snowflake into Jupyter using Dask

I have an SQL Table in Snowflake,100K rows and 15 Columns. I want to import this table into my Jupyter notebook using Dask for further analysis. Primarily doing this a form of practice since I am new ...
OscarOz's user avatar
0 votes
1 answer
64 views

Understanding if current process is part of a multiprocessing pool with --multiprocessing-fork

I need to find a way for a python process to figure out if it was launched as part of a multiprocessing pool. I am using dask to parallelize calculations, using dask.distributed.LocalCluster. For UX ...
pnjun's user avatar
  • 151
0 votes
0 answers
21 views

ModuleNotFoundError: no module named ... when submitting job to client with an object from that module

I have a weird issue I am not able to solve. I have created a Local dask client, and I have a preload script that imports local modules. The preload script contains a list of paths, and runs sys.path....
Rezzy's user avatar
  • 131
0 votes
2 answers
79 views

how do I append the output of a dask_cudf apply function to the original dask_cudf?

I am applying a function (e.g. letter frequency) to a dask_cudf dataframe that consists of a single column of words of fixed length. I am trying to merge the output or append the output into the ...
Austin So's user avatar
0 votes
1 answer
88 views

Optimizing Pandas GroupBy and Aggregation on Large Datasets with Multiple Custom Functions

I'm working with a large Pandas dataframe (about 30.5 million rows) where I need to group by multiple columns and apply different custom aggregation functions. However, the performance is currently a ...
Paras's user avatar
  • 17
2 votes
0 answers
25 views

How to speed up a dask delayed compute of a large dictionary?

I need to run a random forest classifier that I've put into a function ~ 10,000 times - because I sample randomly each time. I am trying to use dask delayed on a slurm-scheduled HPC cluster. My script ...
Lexical's user avatar
  • 21
1 vote
0 answers
70 views

Working with larger than memory data in numpy

I am working on a project that involves larger than memory numpy 3 dimensional arrays. The project will be deployed with AWS lambda. I am faced with two design choices a) Re-write large parts of the ...
Femi King's user avatar
0 votes
1 answer
29 views

Common workflow for using Dask on HPC systems

I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m ...
Joseph Pena's user avatar
0 votes
0 answers
26 views

Run only a single chunk's worth of data without creating a Dask graph of all my chunks?

# Template xarray based on Earth Engine import numpy as np import pandas as pd import dask import xarray as xr # Define the dimensions time = pd.date_range("2020-12-29T18:57:32.281000", ...
Adriano Matos's user avatar
1 vote
1 answer
74 views

In python xarray, how do I create and subset lazy variables without loading the whole dataarray?

I am trying to create a python function that opens a remote dataset (in a opendap server) using xarray and automatically creates new variables lazily. A use case would be to calculate magnitude and ...
Marcelo Andrioni's user avatar
-1 votes
1 answer
47 views

Is there an efficient way to update / replace a specific value of a dask array in python?

So I have a dask array of integers (1 x 8192) and I want to find an efficient way to replace a specific value. This is the code I am currently using, which is very slow, because dask is immutable, so ...
Illuminator's user avatar
0 votes
0 answers
43 views

How to fix memory errors merging large dask dataframes?

I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues. I used to use pandas to join these together ...
ifightfortheuserz's user avatar
2 votes
1 answer
71 views

xarray and dask: efficiently processing a large netcdf file

I am trying to do a simple calculation on a very large netcdf file, and am struggling to speed it up -- probably because I program primarily in julia and R. I think xarray/dask are the best approach ...
j lev's user avatar
  • 23
2 votes
1 answer
124 views

How to accelerate getting points within distance using two DataFrames?

I have two DataFrames (df and locations_df), and both have longitude and latitude values. I'm trying to find the df's points within 2 km of each row of locations_df. I tried to vectorize the function, ...
zxdawn's user avatar
  • 981
0 votes
1 answer
36 views

dask map_partitions strange behaviour

When I create a dask dataframe from pandas with 1 partition, then call map_partitions() on it, it seems to be called twice. If I have 5 partitions, it is called 6 times. In general, the function is ...
Gururaj Deshpande's user avatar
0 votes
1 answer
57 views

How do I set the timeout parameter in dask scheduler

I was trying to run dask-distributed to distribute some big computation in a slurm cluster. I was always getting a "TimeoutError: No valid workers found" message (this came from line 6130 in ...
Fidel's user avatar
  • 37
0 votes
0 answers
27 views

How to Serialize and read metadata in Dask Dataframe

I'm trying to pre-compute dask divisions and categories for certain columns in dask dataframe, then save data as new dataframe and reuse it down the road. Issue I encountered is that when I read ...
imachabeli's user avatar
0 votes
0 answers
23 views

Does dask worker heartbeat when loaded with tasks?

When dask workers have been assigned some tasks by schedulers, do the workers send heartbeats to scheduler or while performing tasks then don't heart beat. I am thinking that while working dask-...
Chaitanya Kshirsagar's user avatar
0 votes
0 answers
39 views

Null values in log file from Python logging module when used with Dask

I have been trying to setup logging using the logging module in a Python script, and I have got it working properly. It can now log to both the console and a log file. But if fails when I setup a Dask ...
RogUE's user avatar
  • 363
0 votes
0 answers
19 views

Does the OS version need to match for Dask client and scheduler/workers?

I have a Dask cluster running on Fargate/ECS. On an EC2 instance, I am running the client that submits a job (subset and average along particular dimensions on a Zarr dataset) to the cluster. I am ...
dmn's user avatar
  • 23
0 votes
0 answers
74 views

Progress bar on xarray.open_mfdataset?

I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that ...
Tor's user avatar
  • 757
0 votes
0 answers
18 views

Use case for dask to spread model execution across machines

I'd like to know if this is a good use case for dask and how to implement it. I have a rest API endpoint that will be receiving a POST request with some parameters in a JSON object. We will be ...
Rezzy's user avatar
  • 131
0 votes
0 answers
16 views

Detecting schema inconsistencies in json data with Dask bag

I have spent a few days trying to battle an issue with json data and dask, so I finally resort to asking here. I have a collection of json files who should all have the same data structure containing ...
Whynote's user avatar
  • 992
0 votes
1 answer
50 views

How can I keep Dask workers busy when processing large datasets to prevent them from running out of tasks?

I'm trying to process a large dataset (around 1 million tasks) using Dask distributed computing in Python. (I am getting data from a database to process it, and I am retriving around 1M rows). Here I ...
Polymood's user avatar
  • 461
0 votes
0 answers
33 views

Memory issues when using Dask

I have just started using 'dask', and I am running into some memory issues. I would like to share my problem along with a dummy code snippet that illustrates what I’ve done. My goal is to read a large ...
jei L's user avatar
  • 33
0 votes
1 answer
56 views

Dask dataframe running parallel and partitioned by columns

I have a dataframe with multiple companies and countries' data that I am trying to transform in parallel using a function. The data takes a format like this but is much larger and with many more ...
AndyMcCammont's user avatar

1
2 3 4 5
93