The webpage is something like this:
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
How can I find each section with articles within them? That is, after finding h2, find nextsiblings
until the next h2.
If the webpage were like: (which is normally the case)
<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>
<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>
I can write codes like:
for section in soup.findAll('div'):
...
for post in section.findAll('p')
But what should I do with the first webpage if I want to get the same result?
I think you can do something like this:
for section in soup.findAll('h2'):
nextNode = section
while True:
nextNode = nextNode.nextSibling
try:
tag_name = nextNode.name
except AttributeError:
tag_name = ""
if tag_name == "p":
print nextNode.string
else:
print "*****"
break
Given:
<h2>section1</h2>
<p>article1</p>
<p>article2</p>
<p>article3</p>
<h2>section2</h2>
<p>article4</p>
<p>article5</p>
<p>article6</p>
Output:
article1
article2
article3
*****
article4
article5
article6
*****