5

I have several CSV files (50 GB) in an S3 bucket in Amazon Cloud. I am trying to read these files in a Jupyter Notebook (with Python3 Kernel) using the following code:

import boto3
from boto3 import session
import pandas as pd

session = boto3.session.Session(region_name='XXXX')
s3client = session.client('s3', config = boto3.session.Config(signature_version='XXXX'))
response = s3client.get_object(Bucket='myBucket', Key='myKey')

names = ['id','origin','name']
dataset = pd.read_csv(response['Body'], names=names)
dataset.head() 

But I face the following error when I run the code:

valueError: Invalid file path or buffer object type: class 'botocore.response.StreamingBody'

I came across this bug report about pandas and boto3 object not being compatible yet.

My question is, how else can I import these CSV files from my S3 bucket into my Jupyter Notebook which runs on the Cloud.

2 Answers 2

5

You can also use s3fs which allows pandas to read directly from S3:

import s3fs

# csv file
df = pd.read_csv('s3://{bucket_name}/{path_to_file}')

# parquet file
df = pd.read_parquet('s3://{bucket_name}/{path_to_file}')

And then if you have multiple files in a bucket, you can iterate through them like so:

import boto3
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(name='{bucket_name}')
for file in bucket.objects.all():
    # do what you want with the files
    # for example:
    if 'filter' in file.key:
        print(file.key)
        new_df = pd.read_csv('s3:://{bucket_name}/{}'.format(file.key))
4

I am posting this fix to my problem, in case somebody needs it. I replaces the read_csv line with the following and the problem was solved:

dataset = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.