Where do I get it?
Firstly, DON’T crawl data from Wikipedia website. It will take ages and if you’re still not convinced, Wikipedia has a specific section on why crawlers are bad. Instead, Wikipedia lets you download their database dump . So, let’s get your hand dirty and download the data from the torent link (get the first highlighted link if you’re not sure). The XML database file is about 64GB after decompressed.
Working with Wikipedia XML dumps
The easiest way to read data is to import the XML file into a database using Wikipedia provided tools. Or, if you want to read the data directly from XML, check out basic wikipedia parsing or Attardi’s wikiextractor.
Before we begin, let’s create a database for storing the dump. In this post, I use MySQL but it would be better to use PostgreSQL since the database is quite large.
Next, we’re going to use the Java MWDumpper to import the XML into the database.
MWDumpper reads in the XML file and produces the query to insert data into the table in a pipeline manner. Therefore, you can filter these queries to get the data you want. You can also check out MWDumpper filter options.
In my case, I only want to get the content of English articles. Thus, the full command I used to import the table text was
Parsing Wiki text
Now that we have the data at hand, we can start the fun part and parse the articles!!!
Here is an example code to connect to the database and reading the articles. Since MySQL doesn’t have good support for large data, we have to manually query partition of the data. PostgreSQL would be a better choice, also with better Python documentation.
Wikipedia use a special type of Markdown call WikiText which changes overtime and is not backward compatible. There are a bunch of parsers for it but wikitextparser seems to me the best one for Python. It supports easy parsing and extracting of templates, sections, lists, etc. The example code shows how to extract tables and save them to CSV files.
The full code will be available on my github soon. Happy training :-)