I want to extract texts from HTML page(s) which placed in p
and li
tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.
How I can get p
tags using jsoup
Elements e = doc.select("");
What could be the string to be written in that parameter?
This can do the job
Elements e=doc.select("p");
Here is a list of all selectors you can use.
Suppose you have this html:
String html="<p>some <strong>bold</strong> text</p>";
To get some bold text
as result you should use:
Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text
or
String text = p.text(); //some bold text
Suppose now you have the following complex html
String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"
To get the values from the two p
tags you have to do something like this
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");
String pConcatenated="";
for (Element x: p) {
pConcatenated+= x.text();
}
System.out.println(pConcatenated);//sometext another p tag
You can find more info here also
Hope this helped