Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?
mytag.find_all()
already does that:
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on
from bs4 import BeautifulSoup # pip install beautifulsoup4
soup = BeautifulSoup("""<!doctype html>
<div id=a>A
<div id=1>A1</div>
<div id=2>A2</div>
</div>
<div id=b>B
<div id=I>BI</div>
<div id=II>BII</div>
</div>
""")
for div in soup.find_all("div", recursive=True):
print(div.get('id'))
a
1
2
b
I
II
The output confirms that, it is a depth first traversal.
Old Beautiful Soup 3 answer:
recursiveChildGenerator()
already does that:
soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
name = getattr(child, "name", None)
if name is not None:
print name
elif not child.isspace(): # leaf node, don't print spaces
print child
For the html from @msalvadores's answer:
html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html
NOTE: html
is printed twice due to the example contains two opening <html>
tags.