Depth First Traversal on BeautifulSoup Parse Tree

Ian Burris picture Ian Burris · Jan 27, 2011 · Viewed 11.3k times · Source

Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?

Answer

jfs picture jfs · Jan 27, 2011

mytag.find_all() already does that:

If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on

from bs4 import BeautifulSoup  # pip install beautifulsoup4

soup = BeautifulSoup("""<!doctype html>
<div id=a>A
  <div id=1>A1</div>
  <div id=2>A2</div>
</div>
<div id=b>B
  <div id=I>BI</div>
  <div id=II>BII</div>
</div>
""")

for div in soup.find_all("div", recursive=True):
    print(div.get('id'))

Output

a
1
2
b
I
II

The output confirms that, it is a depth first traversal.


Old Beautiful Soup 3 answer:

recursiveChildGenerator() already does that:

soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
     name = getattr(child, "name", None)
     if name is not None:
         print name
     elif not child.isspace(): # leaf node, don't print spaces
         print child

Output

For the html from @msalvadores's answer:

html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html

NOTE: html is printed twice due to the example contains two opening <html> tags.