Entries tagged “analysis”

Seven and a half years of Evolution

To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.

Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…

The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.

Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.

Analysis to follow.

Editors and edits

Hans and I met up this evening to discuss the moving average data I had collected. “But what about editors?”, he asked. So I extended it to get that too. Here’s the data. (Hat tip to Vaibhav Bhawsar, who also pointed that out.)

This chart is a visual mess. It’s also close to the limits of how much data I can pass the Google Chart API, so I’ll need a better system in place for the next round, something that allows zooming in for closer analysis. For what it’s worth, here are the key things about this chart:

  1. The blue Moving Window line is now the sum of the preceding seven day period, not the average.
  2. The dark gray Editors in Window line is the number of unique editors within each window.
  3. The y-axis labels are off by a little bit. I can’t figure out why they are not properly calibrated.
  4. Edit Count and Editor Count hug each other closely, but have clearly visible differences in the moving window.

Edit history of Evolution on Wikipedia

If you look at the three biggest blue peaks, the first on June 11 (54) and third on March 1 (59) have a large number of editors (27 and 24), while the peak of August 25 (50) has only 11 editors. You may recall from the last graph that August 25 was the day of the highest number of edits in the year.

Hans thinks that if we render a scatter graph plotting 1/edits-in-window for the x-axis and editors-in-window/edits-in-window for the y-axis, the first and third peaks will show up close to (0,1).

Assuming this works as predicted, we’re close to building a first level user-facing analysis tool: give it a page and a date range, and it’ll tell you approximately when there was an edit war, for closer inspection using content analysis.

Analysing Wikipedia

Earlier this year I began a project with the Centre for Internet and Society analysing editing behaviour on Wikipedia. We’re hoping to find statistical patterns indicative of pack editing behaviour, and to build tools for editors to counter vandals. My collaborator Hans Varghese Mathews does the heavy statistical analysis. I do the code.

Here’s what we’ve got so far:

The blog

The project’s official blog is at CIS. Current posts include an introduction and a description of how Hans did his first statistical analysis (also available as a PDF rendering from the original in Latex).

The code

For our first two attempts, we looked at (a) an article’s editors and all other articles they edited in a given time period, and (b) the words they added or removed from an article. The code’s written in Python using the mwclient library. Source code for the two analyses, under the liberal New BSD license.

Other work

There’s a ton of interesting work around Wikipedia ranging from edit stream analysis to estimation of article quality to understanding Wikipedia’s influence on popular culture. Here’s my collection of references. We’d like to build on existing work but not replicate it, so keeping track is important.

What’s next

I’ll describe how the existing two pieces of code work in the following posts, leading up to a third analysis. You can follow the posts in this blog either via the Research category or with the cis-wikipedia tag. Both categories and tags on my blog have their own feeds that you can subscribe to. You’ll see a link to the feed in the right-hand corner of your browser’s address bar.