I have a data set that is larger than my memory. In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size. Usually i get around this by just dealing one file at the time, but now I'm performing a computation that requires me to load all the data at once. I'm looking for suggestions of how to tackle this problem. Already been reading a bit about dask and pyspark, but not sure is what I need. Can't divide my data into chunks due to the fact that I'm performing a PCA (principal component analysis) of the data so I need to perform the calculation over the whole of it, the data are velocity fields, not tables. Perhaps changing the float format of the array in memory could work or any other trick to compress the array in memory. All the files at each point are in pickle format and are 3200 files, giving a total of about 32 Tb of data.
I have 64 Gb of RAM and a CPU with 32 cores.
Any guidance over this issue is very much appreciated.