Given a URL, how do I extract the registered domain using the Public Suffix List (list of effective TLDs, e.g. this list)?
For instance, considering a.bg
is a valid public suffix:
http://www.test.start.a.bg/hello.html -> start.a.bg
http://test.start.a.bg/ -> start.a.bg
http://test.start.abc.bg/ -> abc.bg (.bg is the public suffix)
This cannot be done using simple string manipulation because the public suffix can consist of multiple levels depending on the TLD.
P.S. It doesn't matter how I read the list (database or flat file), but the list should be accessible locally so I'm not always dependent on external services.
You can use parse_url()
to extract the hostname, then use the library provided by regdom to determine the registered domain name (dn + eTLD). For example:
require_once("effectiveTLDs.inc.php");
require_once("regDomain.inc.php");
$url = 'http://www.metu.edu.tr/dhasjkdas/sadsdds/sdda/sdads.html';
echo getRegisteredDomain(parse_url($url, PHP_URL_HOST));
That will print out metu.edu.tr
.
Other examples I've tried:
http://www.xyz.start.bg/hello -> start.bg
http://www.start.a.bg/world -> start.a.bg (a.bg is a listed eTLD)
http://xyz.ma219.metu.edu.tr -> metu.edu.tr
http://www.google.com/search -> google.com
http://google.co.uk/search?asd -> google.co.uk
UPDATE: These libraries have been moved to: https://github.com/leth/registered-domains-php