I'm scraping content from a website using Python. First I used BeautifulSoup
and Mechanize
on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium
.
Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath
, what reason is there to use BeautifulSoup
when I could just use Selenium for everything?
And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?
Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request
) with lxml
or BeautifulSoup
, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:
requests
.Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.
Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:
If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.