I am using urllib to get a string of html from a website and need to put each word in the html document into a list.
Here is the code I have so far. I keep getting an error. I have also copied the error below.
import urllib.request
url = input("Please enter a URL: ")
z=urllib.request.urlopen(url)
z=str(z.read())
removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
words = removeSpecialChars.split()
print ("Words list: ", words[0:20])
Here is the error.
Please enter a URL: http://simleyfootball.com
Traceback (most recent call last):
File "C:\Users\jeremy.KLUG\My Documents\LiClipse Workspace\Python Project 2\Module2.py", line 7, in <module>
removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
TypeError: replace() takes at least 2 arguments (1 given)
One way is to use re.sub, that's my preferred way.
import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
Output:
hey there
Another way is to use re.escape:
import string
import re
my_str = "hey th~!ere"
chars = re.escape(string.punctuation)
print re.sub(r'['+chars+']', '',my_str)
Output:
hey there
Just a small tip about parameters style in python by PEP-8 parameters should be remove_special_chars
and not removeSpecialChars
Also if you want to keep the spaces just change [^a-zA-Z0-9 \n\.]
to [^a-zA-Z0-9\n\.]