4,647 questions
0
votes
0
answers
6
views
Initializing a local cluster in Dask takes forever
I'm trying out some things with Dask for the first time, and while I had it running a few weeks ago, I now find that I can't get the LocalCluster initiated. I've cut if off after running 30 minutes at ...
0
votes
1
answer
28
views
Pandas read_excel with nrows and skiprows and lazy-loading?
I am searching for ways to read an .xlsx file as chunks of dataframes, instead of loading the whole thing into memory. What exactly happens when I pd.read_excel(nrows, skiprows, usecols) ? Is the ...
0
votes
0
answers
28
views
dask_cuda problem with Local CUDA Cluster
I am trying to get this code to work and then use it to train various models on two gpu's:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
if __name__ == "__main__&...
0
votes
0
answers
19
views
Finding a way to iterate using the input of two xarray dataarrays when chunked
I am developing a relatively large model using xarray and therefore want to make use of chunks. Most of my operations run a lot faster when chunked but there is one that keeps running (a lot) slower ...
1
vote
1
answer
34
views
Sampling a parquet file with only one row_group
I am working on a huge parquet file containing over 30 million rows. I only need a small portion of it and wish pick a number of randomly selected rows. When i inspect the file's metadata, there's ...
0
votes
0
answers
7
views
What is ToParquetBarrier in Dask Dataframe
I'm trying to code some heavy data merging tool and I would like to handle more data than fits into my ram at the same time.
It seems like the to_parquet function has a function called "...
0
votes
0
answers
23
views
Processing the same array, dask.array is too slow compared to numpy.array
import BWStest as bws
import numpy as np
from skimage.measure import label
import dask.array
from tqdm import tqdm
CalWin = [7,25]
stack = []
thershold = 0.05
for i in range(5):
image = np.random....
0
votes
1
answer
50
views
Error when using xarray.apply_ufunc on a chunked xarray DataArray
Hello I am trying to sort the data inside some netCDF4 (.nc) files into bins as efficiently as possible. I am currently trying this with xarray and NumPy's digitize function. Since I want to process a ...
1
vote
1
answer
27
views
How to nest dask.delayed functions within other dask.delayed functions
I am trying to learn dask, and have created the following toy example of a delayed pipeline.
+-----+ +-----+ +-----+
| baz +--+ bar +--+ foo |
+-----+ +-----+ +-----+
So baz has a dependency on ...
0
votes
1
answer
33
views
Dask distributed pause task to wait for subtask - how-to, or bad practice?
I am running tasks using client.submit thus:
from dask.distributed import Client, get_client, wait, as_completed
# other imports
zip_and_upload_futures = [ client.submit(zip_and_upload, id, path, ...
0
votes
1
answer
32
views
Storing larger-than-memory xarray dataset to zarr using `dask.delayed` without blowing up memory
I'm trying to use dask to conduct larger-than-memory processing in xarray. Concretely, I'm trying to:
Concatenate several NetCDF files (on the same geo grid, same variables) by time
Regrid them to a ...
0
votes
0
answers
27
views
Is there a way apart from chunking of Dask to avoid excessive RAM usage due to large dataset?
I am using the following code to calculate some speed-related variables for a dataset that consists of about 200 million rows. In order to avoid Memory issues, I am using chunking.
import pandas as pd
...
0
votes
0
answers
40
views
Why is joblib's Parallel delayed faster than dasks map block and compute()
This question is possibly related to this one. I have 4D numpy array and would like to apply a function to each 2D slice across the first two dimensions. I have implemented the analysis for both dask ...
1
vote
0
answers
20
views
How to make xarray.to_zarr write the same way as dask.array.to_zarr?
I have a 3D xarray DataArray that I've saved to to zarr using DataArray.to_zarr(). However, the resulting format is not readable by my tools, in this case napari. The structure of the zarr has one ...
0
votes
0
answers
65
views
How to apply a function to each 2D slice of a 4D Numpy array in parallel with Dask without running out of RAM?
I want to apply a function to each 2D slice of a 4D Numpy array using Dask. The output should be a 2D matrix (the function applied to each 2D slice returns a single value). I would like to do this in ...
0
votes
0
answers
18
views
Why does xarray.groupby with Dask-backed datasets fail on large data?
I'm working with a large Dask-backed Xarray dataset loaded from a Zarr store. The dataset has dimensions along a dim1 axis and includes a coordinate (group_key) and a data variable (data_var). Both ...
0
votes
0
answers
24
views
How does parquet read happens in the Modin distributed environment?
I am using Modin for reading the parquet file through panda. I have two clusters of 256 GB of RAM but even though duckdb on the single machine takes 50 seconds to go group by. Modin is going infinite ...
1
vote
1
answer
252
views
Django + Dask integration: usage and progress?
About performance & best practice
Note, the entire code for the question below is public on Github.
Feel free to check out the project! https://github.com/b-long/moose-dj-uv/pull/3
I'm trying to ...
0
votes
0
answers
20
views
Dask worker running long calls
The code running on the dask worker calls asyncio.run() and proceeds to exectue a series of async calls (on the worker running event_loop) that gather data, and then run a small computation.
This ...
0
votes
0
answers
22
views
Dask concat issue with DataFrames of different lengths, pandas works fine
I’m trying to use dask to read at least 12 files with approx. 700 MB, each. The files were created on the same way, with only one column and with exactly the same number of lines (about 43 million ...
0
votes
0
answers
22
views
Dask-distributed LocalCluster spinup hangs only in non-interactive execution
I want to start and connect to a dask.distributed LocalCluster from within a python script. However, I get different behavior depending on the way in which I invoke the same lines of code. The ...
0
votes
1
answer
33
views
DASK to_csv() problems due to memory
I'm cleaning my text data and afterwards want to save it to csv. Defined cleaning functions work fine, but when to_csv() part comes, here come the problems as well.
Maybe someone have faced similar ...
0
votes
1
answer
18
views
Dask, how to drop row with specific value in a variable into lazy computing
I'm trying to learn to walkaround with dask for my machine learning project.
My data set is too big to play with Pandas, so I must stay in lazy loading.
here a smal sample to show how it is set up:
I ...
0
votes
0
answers
29
views
Dask running very slow
I am new to using Dask. I have a small piece of code like below.
In this code, I am triggering computation on.
el.compute()
Is this the correct way to use Dask's compute function?
My code runs very ...
0
votes
0
answers
25
views
Importing SQL Table from Snowflake into Jupyter using Dask
I have an SQL Table in Snowflake,100K rows and 15 Columns. I want to import this table into my Jupyter notebook using Dask for further analysis. Primarily doing this a form of practice since I am new ...
0
votes
1
answer
64
views
Understanding if current process is part of a multiprocessing pool with --multiprocessing-fork
I need to find a way for a python process to figure out if it was launched as part of a multiprocessing pool.
I am using dask to parallelize calculations, using dask.distributed.LocalCluster. For UX ...
0
votes
0
answers
21
views
ModuleNotFoundError: no module named ... when submitting job to client with an object from that module
I have a weird issue I am not able to solve.
I have created a Local dask client, and I have a preload script that imports local modules. The preload script contains a list of paths, and runs sys.path....
0
votes
2
answers
79
views
how do I append the output of a dask_cudf apply function to the original dask_cudf?
I am applying a function (e.g. letter frequency) to a dask_cudf dataframe that consists of a single column of words of fixed length.
I am trying to merge the output or append the output into the ...
0
votes
1
answer
88
views
Optimizing Pandas GroupBy and Aggregation on Large Datasets with Multiple Custom Functions
I'm working with a large Pandas dataframe (about 30.5 million rows) where I need to group by multiple columns and apply different custom aggregation functions. However, the performance is currently a ...
2
votes
0
answers
25
views
How to speed up a dask delayed compute of a large dictionary?
I need to run a random forest classifier that I've put into a function ~ 10,000 times - because I sample randomly each time. I am trying to use dask delayed on a slurm-scheduled HPC cluster. My script ...
1
vote
0
answers
70
views
Working with larger than memory data in numpy
I am working on a project that involves larger than memory numpy 3 dimensional arrays. The project will be deployed with AWS lambda. I am faced with two design choices
a) Re-write large parts of the ...
0
votes
1
answer
29
views
Common workflow for using Dask on HPC systems
I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m ...
0
votes
0
answers
26
views
Run only a single chunk's worth of data without creating a Dask graph of all my chunks?
# Template xarray based on Earth Engine
import numpy as np
import pandas as pd
import dask
import xarray as xr
# Define the dimensions
time = pd.date_range("2020-12-29T18:57:32.281000", ...
1
vote
1
answer
74
views
In python xarray, how do I create and subset lazy variables without loading the whole dataarray?
I am trying to create a python function that opens a remote dataset (in a opendap server) using xarray and automatically creates new variables lazily. A use case would be to calculate magnitude and ...
-1
votes
1
answer
47
views
Is there an efficient way to update / replace a specific value of a dask array in python?
So I have a dask array of integers (1 x 8192) and I want to find an efficient way to replace a specific value.
This is the code I am currently using, which is very slow, because dask is immutable, so ...
0
votes
0
answers
43
views
How to fix memory errors merging large dask dataframes?
I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues.
I used to use pandas to join these together ...
2
votes
1
answer
71
views
xarray and dask: efficiently processing a large netcdf file
I am trying to do a simple calculation on a very large netcdf file, and am struggling to speed it up -- probably because I program primarily in julia and R. I think xarray/dask are the best approach ...
2
votes
1
answer
124
views
How to accelerate getting points within distance using two DataFrames?
I have two DataFrames (df and locations_df), and both have longitude and latitude values. I'm trying to find the df's points within 2 km of each row of locations_df.
I tried to vectorize the function, ...
0
votes
1
answer
36
views
dask map_partitions strange behaviour
When I create a dask dataframe from pandas with 1 partition, then call map_partitions() on it, it seems to be called twice. If I have 5 partitions, it is called 6 times. In general, the function is ...
0
votes
1
answer
57
views
How do I set the timeout parameter in dask scheduler
I was trying to run dask-distributed to distribute some big computation in a slurm cluster.
I was always getting a "TimeoutError: No valid workers found" message (this came from line 6130 in ...
0
votes
0
answers
27
views
How to Serialize and read metadata in Dask Dataframe
I'm trying to pre-compute dask divisions and categories for certain columns in dask dataframe, then save data as new dataframe and reuse it down the road.
Issue I encountered is that when I read ...
0
votes
0
answers
23
views
Does dask worker heartbeat when loaded with tasks?
When dask workers have been assigned some tasks by schedulers, do the workers send heartbeats to scheduler or while performing tasks then don't heart beat. I am thinking that while working dask-...
0
votes
0
answers
39
views
Null values in log file from Python logging module when used with Dask
I have been trying to setup logging using the logging module in a Python script, and I have got it working properly. It can now log to both the console and a log file. But if fails when I setup a Dask ...
0
votes
0
answers
19
views
Does the OS version need to match for Dask client and scheduler/workers?
I have a Dask cluster running on Fargate/ECS. On an EC2 instance, I am running the client that submits a job (subset and average along particular dimensions on a Zarr dataset) to the cluster. I am ...
0
votes
0
answers
74
views
Progress bar on xarray.open_mfdataset?
I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that ...
0
votes
0
answers
18
views
Use case for dask to spread model execution across machines
I'd like to know if this is a good use case for dask and how to implement it.
I have a rest API endpoint that will be receiving a POST request with some parameters in a JSON object. We will be ...
0
votes
0
answers
16
views
Detecting schema inconsistencies in json data with Dask bag
I have spent a few days trying to battle an issue with json data and dask, so I finally resort to asking here.
I have a collection of json files who should all have the same data structure containing ...
0
votes
1
answer
50
views
How can I keep Dask workers busy when processing large datasets to prevent them from running out of tasks?
I'm trying to process a large dataset (around 1 million tasks) using Dask distributed computing in Python. (I am getting data from a database to process it, and I am retriving around 1M rows). Here I ...
0
votes
0
answers
33
views
Memory issues when using Dask
I have just started using 'dask', and I am running into some memory issues. I would like to share my problem along with a dummy code snippet that illustrates what I’ve done.
My goal is to read a large ...
0
votes
1
answer
56
views
Dask dataframe running parallel and partitioned by columns
I have a dataframe with multiple companies and countries' data that I am trying to transform in parallel using a function. The data takes a format like this but is much larger and with many more ...