Get first lines of Wikipedia Article

theomega picture theomega · Oct 14, 2009 · Viewed 13.8k times · Source

I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

So how would you accomplish this task? Programming language is java, but this shouldn't matter.

A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]

Thanks!

Answer

octosquidopus picture octosquidopus · Nov 5, 2013

You don't need to.

The API's exintro parameter returns only the first (zeroth) section of the article.

Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein

There are other parameters, too:

  • exchars Length of extracts in characters.
  • exsentences Number of sentences to return.
  • exintro Return only zeroth section.
  • exsectionformat What section heading format to use for plaintext extracts:

    wiki — e.g., == Wikitext ==
    plain — no special decoration
    raw — this extension's internal representation
    
  • exlimit Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.
  • explaintext Return plain-text extracts.
  • excontinue When more results are available, use this parameter to continue.

Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts