I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.
One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
Instead of the expected result http://www.example.com//path
(or even better, with a normalized single slash), I end up with http://path
.
BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.
Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?
If you only want to get the url without the query part, I would skip the urlparse module and just do:
testUrl.rsplit('?')
The url will be at index 0 of the list returned and the query at index 1.
It is not possible to have two '?' in an url so it should work for all urls.