Get Root Domain of Link

Gavin Schulz picture Gavin Schulz · Oct 5, 2009 · Viewed 19.8k times · Source

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

Answer

Ben Blank picture Ben Blank · Oct 5, 2009

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

import publicsuffix
import urlparse

def get_base_domain(url):
    # This causes an HTTP request; if your script is running more than,
    # say, once a day, you'd want to cache it yourself.  Make sure you
    # update frequently, though!
    psl = publicsuffix.fetch()

    hostname = urlparse.urlparse(url).hostname

    return publicsuffix.get_public_suffix(hostname, psl)