Get first lines of Wikipedia Article

Question 1

Get first lines of Wikipedia Article

parsing wikipedia wikipedia-api

theomega · Oct 14, 2009 · Viewed 13.8k times · Source

Answer

Answer

You don't need to.

The API's exintro parameter returns only the first (zeroth) section of the article.

Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein

There are other parameters, too:

exchars Length of extracts in characters.
exsentences Number of sentences to return.
exintro Return only zeroth section.

exsectionformat What section heading format to use for plaintext extracts:

wiki — e.g., == Wikitext ==
plain — no special decoration
raw — this extension's internal representation

exlimit Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.
explaintext Return plain-text extracts.
excontinue When more results are available, use this parameter to continue.

Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts

Question 2

I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

So how would you accomplish this task? Programming language is java, but this shouldn't matter.

A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]

Thanks!

Get first lines of Wikipedia Article

Answer

Related questions