SGML Parser in Python

afg102 picture afg102 · Jan 8, 2011 · Viewed 9.8k times · Source

I am completely new to Python. I have the following code:

class ExtractTitle(sgmllib.SGMLParser):

def __init__(self, verbose=0):

   sgmllib.SGMLParser.__init__(self, verbose)

   self.title = self.data = None

def handle_data(self, data):

  if self.data is not None:
    self.data.append(data)

def start_title(self, attrs):
 self.data = []

def end_title(self):

  self.title = string.join(self.data, "")

raise FoundTitle # abort parsing!

which extracts the title element from SGML, however it only works for a single title. I know I have to overload the unknown_starttag and unknown_endtag in order to get all titles but I keep getting it wrong. Help me please!!!

Answer

Chris Morgan picture Chris Morgan · Jan 8, 2011

Beautiful Soup is one way you could parse it nicely (and it's the way I'd always do it, unless there was some extremely good reason not to do it that way, myself). It's a lot simpler and more readable than using SGMLParser.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<post id='100'> <title> new title </title> <text> <p> new text </p> </text> </post>''')
>>> soup('post')  # soup.findAll('post') is equivalent
[<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>]
>>> for post in soup('post'):
...     print post.findChild('text')
...
<text> <p> new text </p> </text>

Once you've got it at this stage, you can do various things with it, depending on how you want it.

>>> post = soup.find('post')
>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post_text = post.findChild('text')
>>> post_text
<text> <p> new text </p> </text>

You could be wanting to strip out HTML.

>>> post_text.text
u'new text'

Or perhaps look at the contents...

>>> post_text.renderContents()
' <p> new text </p> ']
>>> post_text.contents
[u' ', <p> new text </p>, u' ']

There are all sorts of things you could want to do. If you're more specific - in especial, providing real data - it helps.

When it comes to manipulating the tree, you can do that too.

>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post.title  # Just as good as post.findChild('title')
<title> new title </title>
>>> post.title.extract()  # Throws it out of the tree and returns it but we have no need for it
<title> new title </title>
>>> post  # title is gone!
<post id="100">  <text> <p> new text </p> </text> </post>
>>> post.findChild('text').replaceWithChildren()  # Thrown away the <text> wrapping
>>> post
<post id="100">   <p> new text </p>  </post>

And so, finally, you'd have something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''
... <post id='100'> <title> new title 100 </title> <text> <p> new text 100 </p> </text> </post>
... <post id='101'> <title> new title 101 </title> <text> <p> new text 101 </p> </text> </post>
... <post id='102'> <title> new title 102 </title> <text> <p> new text 102 </p> </text> </post>
... ''')
>>> for post in soup('post'):
...     post.title.extract()
...     post.findChild('text').replaceWithChildren()
... 
<title> new title 100 </title>
<title> new title 101 </title>
<title> new title 102 </title>
>>> soup

<post id="100">   <p> new text 100 </p>  </post>
<post id="101">   <p> new text 101 </p>  </post>
<post id="102">   <p> new text 102 </p>  </post>