Scraping the content of all div tags with a specific class

Andrew Brēza picture Andrew Brēza · Jan 22, 2018 · Viewed 7.6k times · Source

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

My desired output is...

"Hello, world"
"Good afternoon, world"

The code below extracts the text from every div, but I can't figure out how to include only class="a".

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").

Answer

neilfws picture neilfws · Jan 22, 2018

The CSS selector for div with class = "a" is div.a:

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

Or you can use XPath:

html_nodes(xpath = "//div[@class='a']")