Scrapy Body Text Only

mmrs151 picture mmrs151 · Mar 22, 2011 · Viewed 8.4k times · Source

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.

Wishing some scholars might be able to help me here scraping all the text from the <body> tag.

Answer

Eli Bendersky picture Eli Bendersky · Mar 22, 2011

Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body>? (assuming it's nested in <html>). It might be even simpler to use the //body selector:

x.select("//body").extract()    # extract body

You can find more information about the selectors Scrapy provides here.