How to extract texts between <p> tags

rena-c picture rena-c · May 23, 2013 · Viewed 29.1k times · Source

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.

How I can get p tags using jsoup

Elements e = doc.select(""); 

What could be the string to be written in that parameter?

Answer

MaVRoSCy picture MaVRoSCy · May 23, 2013

This can do the job

Elements e=doc.select("p"); 

Here is a list of all selectors you can use.

Suppose you have this html:

String html="<p>some <strong>bold</strong> text</p>";

To get some bold text as result you should use:

Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text

or

String text = p.text(); //some bold text

Suppose now you have the following complex html

String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"

To get the values from the two p tags you have to do something like this

Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");

String pConcatenated="";
for (Element x: p) {
  pConcatenated+= x.text();
}

System.out.println(pConcatenated);//sometext another p tag

You can find more info here also

Hope this helped