Tuesday, May 19, 2009
Analysing Wikipedia
Earlier this year I began a project with the Centre for Internet and Society analysing editing behaviour on Wikipedia. We’re hoping to find statistical patterns indicative of pack editing behaviour, and to build tools for editors to counter vandals. My collaborator Hans Varghese Mathews does the heavy statistical analysis. I do the code.
Here’s what we’ve got so far:
The blog
The project’s official blog is at CIS. Current posts include an introduction and a description of how Hans did his first statistical analysis (also available as a PDF rendering from the original in Latex).
The code
For our first two attempts, we looked at (a) an article’s editors and all other articles they edited in a given time period, and (b) the words they added or removed from an article. The code’s written in Python using the mwclient library. Source code for the two analyses, under the liberal New BSD license.
Other work
There’s a ton of interesting work around Wikipedia ranging from edit stream analysis to estimation of article quality to understanding Wikipedia’s influence on popular culture. Here’s my collection of references. We’d like to build on existing work but not replicate it, so keeping track is important.
What’s next
I’ll describe how the existing two pieces of code work in the following posts, leading up to a third analysis. You can follow the posts in this blog either via the Research category or with the cis-wikipedia tag. Both categories and tags on my blog have their own feeds that you can subscribe to. You’ll see a link to the feed in the right-hand corner of your browser’s address bar.