Seven and a half years of Evolution

To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.

Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…

The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.

Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.

Analysis to follow.

  • Avatar

    Yuvi — Jul 5, 2009 1:30:13 PM — #

    Awesome!

    And thanks for the BSD License :)

    • Avatar

      Kiran Jonnalagadda — Jul 6, 2009 1:34:48 AM — #

      All my code is BSD licensed. I’m not much of a believer in the GPL these days.

  • Avatar

    Ramkumar Ramachandra — Feb 20, 2010 9:47:03 AM — #

    Could you tell us what you inferred from the data? The raw data file by itself isn’t very useful.

Leave a Reply

You can respond with a photo by tagging it on Flickr with