1

I have a zip directory similar to this one:

folder_to_zip
    - file_1.csv
    - folder_1.zip
        1. file_2.csv
        2. file_3.csv
        3. folder_2.zip
            **.**file_4.csv
            **.**file_5.csv
            **.** file_6.csv
    -file_7.csv

and I would like to "put" each csv file in a different pandas dataframe

The reason I want to do that is because I do not want this project to be too "heavy" ( the zip_folder is just 639MB insted of 7.66 GB)

based on these questions (Python: Open file in zip without temporarily extracting it, Python py7zr can't list files in archive - how to read 7z archive without extracting it) I tried something like this:

from py7zr import SevenZipFile as szf
import os
import pandas as pd


def unzip_(folder_to_zip):
    dfs= []
    if not folder_to_zip.endswith('.csv'):
        dfs.append(pd.read_csv(folder_to_zip))
    else:      
        with szf(folder_to_zip, 'r') as z:
            for f in z.getnames():
                dfs += unzip_(f)
    return dfs       

4
  • What is not working with the current approach? It is not clear from the question
    – Droid
    Commented Mar 2, 2023 at 10:07
  • you're probably going to need a lot more than 8GB RAM to do this, are you expecting that?
    – Sam Mason
    Commented Mar 2, 2023 at 10:07
  • @SamMason 2 thing: first of all thank you for the formatting, is really what I wanted to do but I do not know why it did not work (feel free to give me any suggestion). Secondly, why? I am not going to save anything
    – user19480211
    Commented Mar 2, 2023 at 10:15
  • pandas only knows about uncompressed CSV data, you'd need to decompress them somewhere (presumably RAM) before it can do anything with them. even if you work one file at a time, pandas needs somewhere to store the loaded data in RAM and this is often similar in size to the raw csv data (numbers will likely take less ram, text will likely take more)
    – Sam Mason
    Commented Mar 2, 2023 at 10:20

1 Answer 1

0

If you really want to do this, it would be something like:

import py7zr
import pandas as pd

ar = py7zr.SevenZipFile("archive.7z")
dfs = {}
for name, fd in ar.read(name for name in ar.getnames() if name.endswith(".csv")).items():
    dfs[name] = pd.read_csv(fd)

Note this loads into a dictionary rather than a list (as I'm not sure how well defined the ordering coming out of is).

But given the RAM requirements, this seems less useful in your use case.

2
  • thank you; actually I have to store all the csv files into a MongoDb collection. Your advice is to extract the content of each file, read it with pandas and then store it in Mongo? My point was: this python script will be executed only one time, so what is the reason to save those files on the drive? There is no problem of RAM it just seemed to me smartest
    – user19480211
    Commented Mar 2, 2023 at 10:38
  • if it's only going to be run once and it only takes a few GB then I don't think it matters what you do does it? if you really cared, then I wouldn't store all loaded dataframes in a list, I'd just insert them straight away so they're not all in memory at one time
    – Sam Mason
    Commented Mar 2, 2023 at 11:14

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.