This will surely be an easy one but it is really bugging me.
I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.
All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.
Every time I go to print out a variable that holds 'String' I get [u'String']
printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?
[u'ABC']
would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
soup[0].encode("ascii")
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.
soup[0].encode("latin-1")
soup[0].encode("utf-8")
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
soup[0].encode(soup.originalEncoding)