Generating plain text from a Wikipedia database dump

Asim picture Asim · Mar 31, 2014 · Viewed 8.7k times · Source

I found a Python script (here: Wikipedia Extractor) that can generate plain text from (English) Wikipedia database dump. When I use this command (as it's stated on the script's page):

$ python enwiki-latest-pages-articles.xml WikiExtractor.py -b 500K -o extracted

I get this error:

File "enwiki-latest-pages-articles.xml", line 1 < mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">

^
SyntaxError: invalid syntax

I'm executing the script using Python 2.7.6 & Cygwin on Windows 7.

I hope If anyone has already used this script or experience with Python can help me to solve this error.

Thanks in advance!

Answer

alecxe picture alecxe · Mar 31, 2014

The first argument to python should be the script name.

You probably need to swap xml and py file names:

$ python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted