0

I have a data set that is larger than my memory. In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size. Usually i get around this by just dealing one file at the time, but now I'm performing a computation that requires me to load all the data at once. I'm looking for suggestions of how to tackle this problem. Already been reading a bit about dask and pyspark, but not sure is what I need. Can't divide my data into chunks due to the fact that I'm performing a PCA (principal component analysis) of the data so I need to perform the calculation over the whole of it, the data are velocity fields, not tables. Perhaps changing the float format of the array in memory could work or any other trick to compress the array in memory. All the files at each point are in pickle format and are 3200 files, giving a total of about 32 Tb of data.

I have 64 Gb of RAM and a CPU with 32 cores.

Any guidance over this issue is very much appreciated.

6
  • what compromises are you willing to make?
    – Marat
    Commented Jul 28, 2022 at 15:02
  • I can either skip some files, 2500 instead of 3200 perhaps, or reduce a bit my ROI in the data that way perhaps I can manage to fit it. What do you propose?
    – jsp
    Commented Jul 28, 2022 at 15:06
  • I don't know much about your problem, so some naive questions: is it really necessary to use the entirety of 80Gb from each dataset, or is it sufficient to sample? Are distributions across datasets different enough to require PCA across all of them? Can you run a bunch of PCA on smaller samples to see how stable it is?
    – Marat
    Commented Jul 28, 2022 at 15:12
  • Are you using a custom written PCA or do you use a library to do it for you? If you can use libraries you could try IncrementalPCA from sklearn.
    – Nilau
    Commented Jul 28, 2022 at 15:26
  • 1
    Have you looked at dask-ml? It can do PCA with chunked data. Tensor flow is also a good choice. Commented Jul 28, 2022 at 17:31

1 Answer 1

0

In general you can use data generators for this. That allows you to consume a dataset without loading the complete dataset in memory.

In practice you can use TensorFlow. For the data generator use:

tf.data.Dataset.from_generator

(https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator)

And to apply PCA: tft.pca (https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.