Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.
For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.
The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.
I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?
The category hierarchy information in MediaWiki is stored in the categorylinks
table, so you're going to need the categorylinks
dump.
You're also going to need the page
(not pages-articles
) dump for page id to title mapping.