I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.
The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.
For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.
So how would you accomplish this task? Programming language is java, but this shouldn't matter.
A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]
Thanks!
You don't need to.
The API's exintro
parameter returns only the first (zeroth) section of the article.
Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein
There are other parameters, too:
exchars
Length of extracts in characters.exsentences
Number of sentences to return.exintro
Return only zeroth section.exsectionformat
What section heading format to use for plaintext extracts:
wiki — e.g., == Wikitext ==
plain — no special decoration
raw — this extension's internal representation
exlimit
Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.explaintext
Return plain-text extracts.excontinue
When more results are available, use this parameter to continue. Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts