How can I handle large data in memory using python?

Question

I have a data set that is larger than my memory. In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size. Usually i get around this by just dealing one file at the time, but now I'm performing a computation that requires me to load all the data at once. I'm looking for suggestions of how to tackle this problem. Already been reading a bit about dask and pyspark, but not sure is what I need. Can't divide my data into chunks due to the fact that I'm performing a PCA (principal component analysis) of the data so I need to perform the calculation over the whole of it, the data are velocity fields, not tables. Perhaps changing the float format of the array in memory could work or any other trick to compress the array in memory. All the files at each point are in pickle format and are 3200 files, giving a total of about 32 Tb of data.

I have 64 Gb of RAM and a CPU with 32 cores.

Any guidance over this issue is very much appreciated.

I can either skip some files, 2500 instead of 3200 perhaps, or reduce a bit my ROI in the data that way perhaps I can manage to fit it. What do you propose? — jsp, Commented Jul 28, 2022 at 15:06
I don't know much about your problem, so some naive questions: is it really necessary to use the entirety of 80Gb from each dataset, or is it sufficient to sample? Are distributions across datasets different enough to require PCA across all of them? Can you run a bunch of PCA on smaller samples to see how stable it is? — Marat, Commented Jul 28, 2022 at 15:12
Are you using a custom written PCA or do you use a library to do it for you? If you can use libraries you could try IncrementalPCA from sklearn. — Nilau, Commented Jul 28, 2022 at 15:26
Have you looked at dask-ml? It can do PCA with chunked data. Tensor flow is also a good choice. — Michael Delgado, Commented Jul 28, 2022 at 17:31

johnny b · Accepted Answer · 2022-07-28 15:37:11Z

0

In general you can use data generators for this. That allows you to consume a dataset without loading the complete dataset in memory.

In practice you can use TensorFlow. For the data generator use:

tf.data.Dataset.from_generator

(https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator)

And to apply PCA: tft.pca (https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca)

answered Jul 28, 2022 at 15:37

johnny b

8626 silver badges9 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How can I handle large data in memory using python?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
numpy
bigdata
pickle
dask
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonnumpybigdatapickledask or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
bigdata
pickle
dask
or ask your own question.