I have created a client/server architecture in python, I take HTTP request from the client which is served by requesting another HTTP server through my code.
When I get the response from the third server I am not able to decode the gzip compressed data, I first split the response data using \r\n
as separation character which got me the data as the last item in the list then I tried decompressing it with
zlib.decompress(data[-1])
but it is giving me an error of incorrect headers. How should I go with this problem ?
Code
client_reply = ''
while 1:
chunk = server2.recv(512)
if len(chunk) :
client.send(chunk)
client_reply += chunk
else:
break
client_split = client_reply.split("\r\n")
print client_split[-1].decode('zlib')
I want to read the data that is been transferred between the client and the 2nd server.
Specify the wbits
when using zlib.decompress(string, wbits, bufsize)
see end of "troubleshooting" for example.
Lets start out with a a curl command that downloads a byte-range response with an unknown "content-encoding" (note: we know before hand it is some sort of compressed thing, mabye deflate
maybe gzip
):
export URL="https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00007-ip-10-239-7-51.ec2.internal.warc.gz"
curl -r 266472196-266527075 $URL | gzip -dc | tee hello.txt
With the following response headers:
HTTP/1.1 206 Partial Content
x-amz-id-2: IzdPq3DAPfitkgdXhEwzBSwkxwJRx9ICtfxnnruPCLSMvueRA8j7a05hKr++Na6s
x-amz-request-id: 14B89CED698E0954
Date: Sat, 06 Aug 2016 01:26:03 GMT
Last-Modified: Sat, 07 May 2016 08:39:18 GMT
ETag: "144a93586a13abf27cb9b82b10a87787"
Accept-Ranges: bytes
Content-Range: bytes 266472196-266527075/711047506
Content-Type: application/octet-stream
Content-Length: 54880
Server: AmazonS3
So to the point.
Lets display the hex output of the first 10 bytes:
curl -r 266472196-266472208 $URL | xxd
hex output:
0000000: 1f8b 0800 0000 0000 0000 ecbd eb
We can see some basics of what we are working with with the hex values.
Roughly meaning its probably a gzip ( 1f8b
) using deflate ( 0800
) without a modification time ( 0000 0000
), or any extra flags set ( 00
), using a fat32 system( 00
).
Please refer to section 2.3 / 2.3.1: https://tools.ietf.org/html/rfc1952#section-2.3.1
So onto the python:
>>> import requests
>>> url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00006-ip-10-239-7-51.ec2.internal.warc.gz'
>>> response = requests.get(url, params={"range":"bytes=257173173-257248267"})
>>> unknown_compressed_data = response.content
notice anything similar?:
>>> unknown_compressed_data[:10]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00'
And on to the decompression let's just try at random based on the (documentation):
>>> import zlib
"zlib.error: Error -2 while preparing to decompress data: inconsistent stream state":
>>> zlib.decompress(unknown_compressed_data, -31)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -2 while preparing to decompress data: inconsistent stream state
"Error -3 while decompressing data: incorrect header check":
>>> zlib.decompress(unknown_compressed_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
"zlib.error: Error -3 while decompressing data: invalid distance too far back":
>>> zlib.decompress(unknown_compressed_data, 30)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: invalid distance too far back
>>> zlib.decompress(unknown_compressed_data, 31)
'WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2016-04-28T20:14:16Z\r\nWARC-Record-ID: <urn:uu