Extracting rootdomains from URL string in Google Sheets

snh_nl picture snh_nl · Aug 23, 2015 · Viewed 18k times · Source

Hi I am trying to extract the rootdomain from URL string in Google Sheets. I know how to get the domain and I have the formula to remove www. but now I realize it does not strip subdomain prefixes like 'mysite'.site.com; where mysite is not stripped from the domain name.

Question: How can I retrieve the domain.com rootdomain where the domain string contacts alphanumeric characters, then 1 dot, then alphanumeric characters (and nothing more)

Formula so far in Google Sheets:

=REGEXREPLACE(REGEXREPLACE(D3923;"(http(s)?://)?(www\.)?";"");"/.*";"")

Maybe this can be simplified ...

Test cases

    https://www.domain.com/ => domain.com
    https://domain.com/ => domain.com
    http://www.domain.nl/ => domain.com
    http://domain.de/ => domain.com
    http://www.domain.co.uk/ => domain.co.uk
    http://domain.co.au/ => domain.co.au
    sub.domain.org/ => sub.domain.com
    sub.domain.org => sub.domain.com
    domain.com => domain.com
    http://www.domain.nl?par=1  => domain.com
    https://www.domain.nl/test/?par=1  => domain.com
    http2://sub2.startpagina.nl/test/?par=1  => domain.com

enter image description here

Answer

snh_nl picture snh_nl · Aug 23, 2015

Currently using:

=trim(REGEXEXTRACT(REGEXREPLACE(REGEXREPLACE(A2;"https?://";"");"^(w{3}\.)?";"")&"/";"([^/?]+)"))

Seems to work fine

Updated:7-7-2016

(thanks for all the help!)