Find http:// and or www. and strip from domain. leaving domain.com

Paul Tricklebank picture Paul Tricklebank · Jan 31, 2013 · Viewed 23.6k times · Source

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.

some of the urls in my log file begin with http:// and some begin with www.Some begin with both.

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected. I need the code to be more conditional. TIA

edit... here is my full code...

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

Answer

Markus Unterwaditzer picture Markus Unterwaditzer · Jan 31, 2013

It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]