Canonicalize / normalize a URL?

XZS picture XZS · May 14, 2012 · Viewed 10.5k times · Source

I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com and http://google.com:80/a/../ shall return the same result.

I would prefer Python 3 and already looked through the urllib module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize() function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?

Answer

Thomas K picture Thomas K · May 14, 2012

How about this:

In [1]: from urllib.parse import urljoin

In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'

Inspired by answers to this question. It doesn't normalize ports, but it should be simple to whip up a function that does.