Reading contents of a gzip file from a AWS S3 in Python

Kshitij Marwah picture Kshitij Marwah · Dec 15, 2016 · Viewed 22.5k times · Source

I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.

bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.

Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.

This is what I have tried till now.

bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()

This is giving me a format which is not readable. I also tried

n = obj.get()['Body'].read().decode('utf-8')

which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.

I have also tried

gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()

This returns an error IOError: Not a gzipped file

Not sure how to decode this .gz file.

Edit - Found a solution. Needed to pass n in it and use BytesIO

gzip = BytesIO(n)

Answer

Kirk picture Kirk · Jan 7, 2020

This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223 and python3.7)

import boto3
import gzip

s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    content = gzipfile.read()
print(content)