Sunday, July 5, 2009
Seven and a half years of Evolution
To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.
Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…
The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.
Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.
Analysis to follow.
Yuvi — Jul 5, 2009 1:30:13 PM — # ↩
Awesome!
And thanks for the BSD License :)
Kiran Jonnalagadda — Jul 6, 2009 1:34:48 AM — # ↩
All my code is BSD licensed. I’m not much of a believer in the GPL these days.
Ramkumar Ramachandra — Feb 20, 2010 9:47:03 AM — # ↩
Could you tell us what you inferred from the data? The raw data file by itself isn’t very useful.
Kiran Jonnalagadda — Feb 24, 2010 9:16:19 AM — # ↩
We didn’t get to that, Ram.