I download a file using the get
function of Python requests
library. For storing the file, I'd like to determine the filename the way a web browser would for its 'save' or 'save as ...' dialog.
Easy, right? I can just get it from the Content-Disposition
HTTP header, accessible on the response object:
import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)
But looking more closely at this topic, it isn't that easy:
According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf
) or a quoted string that can also contain whitespace (e.g. "the report.pdf"
) and escape sequences. Further,
when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".
The value of filename*
, though, is yet a bit more complicated than the one of filename
.
Also, the RFC seems to allow for additional whitespace around the =
.
Thus, for the examples listed in the RFC, I'd want the following results:
Content-Disposition: Attachment; filename=example.html
filename: example.html
Content-Disposition: INLINE; FILENAME= "an example.html"
filename: an example.html
Content-Disposition: attachment;
filename*= UTF-8''%e2%82%ac%20rates
filename: € rates
Content-Disposition: attachment;
filename="EURO rates";
filename*=utf-8''%e2%82%ac%20rates
filename: € rates
here, too (not EURO rates
, as filename*
takes precedence)Now, I could easily adapt the regular expression to account for variable whitespace around the =
, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.)
So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe requests
itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)
The rfc6266
library appears to do exactly what you need. It can parse raw headers, requests
responses, and urllib2
responses. It's on PyPI.
Some examples:
>>> import rfc6266, requests
>>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe
'example.html'
>>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe
'an example.html'
>>> rfc6266.parse_headers(
'''attachment; '''
'''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> rfc6266.parse_headers(
'''attachment; '''
'''filename="EURO rates"; '''
'''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> r = requests.get('http://example.com/€ rates')
>>> rfc6266.parse_requests_response(r).filename_unsafe
'€ rates'
As a note, though: this library does not like nonstandard whitespace in the header.