Using BeautifulSoup to search html for string

kachilous picture kachilous · Jan 20, 2012 · Viewed 105.4k times · Source

I am using BeautifulSoup to look for user entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python') find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1) find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched

Answer

sgallen picture sgallen · Jan 20, 2012

The following line is looking for the exact NavigableString 'Python':

>>> soup.body.findAll(text='Python')
[]

Note that the following NavigableString is found:

>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']

Note this behaviour:

>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]

So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.