I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.
bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.
Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.
This is what I have tried till now.
bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()
This is giving me a format which is not readable. I also tried
n = obj.get()['Body'].read().decode('utf-8')
which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.
I have also tried
gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()
This returns an error IOError: Not a gzipped file
Not sure how to decode this .gz file.
Edit - Found a solution. Needed to pass n in it and use BytesIO
gzip = BytesIO(n)
This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223
and python3.7
)
import boto3
import gzip
s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
content = gzipfile.read()
print(content)