Entries in the Category “Research”

Analysing Wikipedia: caching data

I haven’t posted about Wikipedia in a while. Hans went to Ladakh right after I returned, so we’re only now getting around to analysing the data we collected in July.

Our biggest hassle with doing any kind of analysis is with how long it takes to retrieve data. A full text analysis of a few hundred revisions of a large page could easily take an hour to pull. If that analysis doesn’t produce satisfying results, attempting a variation requires pulling that data all over again, because we have no cache.

I use the mwclient library, which provides a thin wrapper representing MediaWiki queries as lazy (?) Python sequences. Since this sequence could be cached, I’ve been considering strategies (some of this assumes familiarity with mwclient):

  1. Implement a simple Python dictionary cache around the mwclient API, saving the query→result mapping as a pickled dump and consult that before hitting the servers again. This is easy, but since the sequences are lazy, the data isn’t available for caching until the code tries to access it. The cache has to intervene then. All my analysis code must now be written for two API’s, mwclient’s and my cache’s.

  2. Alternatively, do the same thing but as a patch to mwclient’s code so there’s a single external API. This requires understanding how it works and maintaining patches against upstream changes, which takes time away from analysis.

  3. Do it outside. Setup Squid or another caching proxy to cache everything regardless of HTTP headers. Make queries through this. Easy to setup, but grossly inefficient. Proxy servers understand request→response mapping, not sequences. If I ask for a subset of an earlier sequence, it’ll treat it as a new request. Sequences require special treatment:

    • There could be newer edits on the site, making the sequence’s beginning and end markers stale.

    • A new query may ask for overlapping results (typically, a query from a fixed starting point to current time). The cache should be able to join sequences instead of duplicating data.

    • A query may ask for the same time range as an earlier query, but with additional properties (typically, the full text of each revision). These additional properties should be merged into the cache.

  4. Drop this approach altogether and get a static dump of the Wikipedia database. But a full text dump of the entire revision history of the English Wikipedia is 150 terabytes. The resource requirements will take us out of the realm of a hobbyist project.

Given that data retrieval time has become a serious hobble, it seems worth tackling this head-on. A custom cache API could:

  1. Be sequence aware. Treat each MediaWiki article as being a sequence of unknown start and end, of which fragments are available in the cache. Join sequence fragments as data gaps are filled in, leading to one single sequence for the page’s entire revision history.

  2. Store additional properties on each revision. MediaWiki does not store diffs between revisions, but the cache could, since much full text analysis is based around the changes introduced by each revision. Properties could also be flags marking pages as, for example, vandalism, or the following reversion.

  3. Based on the above, store alternate sequences and properties specific to them. For example, a revision sequence of an article that skips all vandal/reversion revisions and stores edit diffs without them. Without this, an editor whose sole contribution was to revert vandalism will come out appearing to have added a lot of new material.

A web service implementing this API will, over time, be able to respond to queries in near real time, making it possible to build a web interface where anyone can submit a query. The public web interface is one of our eventual goals for this project.

I’ll post updates as I work out the technical architecture for this API. I’m considering using one of the newfangled key/value pair databases but have no experience with them. Recommendations are welcome.

Editors and edits

Hans and I met up this evening to discuss the moving average data I had collected. “But what about editors?”, he asked. So I extended it to get that too. Here’s the data. (Hat tip to Vaibhav Bhawsar, who also pointed that out.)

This chart is a visual mess. It’s also close to the limits of how much data I can pass the Google Chart API, so I’ll need a better system in place for the next round, something that allows zooming in for closer analysis. For what it’s worth, here are the key things about this chart:

  1. The blue Moving Window line is now the sum of the preceding seven day period, not the average.
  2. The dark gray Editors in Window line is the number of unique editors within each window.
  3. The y-axis labels are off by a little bit. I can’t figure out why they are not properly calibrated.
  4. Edit Count and Editor Count hug each other closely, but have clearly visible differences in the moving window.

Edit history of Evolution on Wikipedia

If you look at the three biggest blue peaks, the first on June 11 (54) and third on March 1 (59) have a large number of editors (27 and 24), while the peak of August 25 (50) has only 11 editors. You may recall from the last graph that August 25 was the day of the highest number of edits in the year.

Hans thinks that if we render a scatter graph plotting 1/edits-in-window for the x-axis and editors-in-window/edits-in-window for the y-axis, the first and third peaks will show up close to (0,1).

Assuming this works as predicted, we’re close to building a first level user-facing analysis tool: give it a page and a date range, and it’ll tell you approximately when there was an edit war, for closer inspection using content analysis.

Detecting reverts in Wikipedia

Sirtaj Singh Kang asked on Twitter:

Are you able to detect reverts? That would be able to help show edit wars. Perhaps a graph of edit/revert avg?

Short answer: no and yes. Reverts are not marked in the database. They can be detected, but the effort is non-trivial.

MediaWiki has for some time shown a “rollback” link on the latest revision in the history, visible to logged in users. If you click it, MediaWiki will do the following:

  1. Make a new revision of the page copying the content of the revision before the last.
  2. Annotate this new revision as a minor change, with the comment “Reverted edits by (user) to last revision by (earlier user)”.

It’s easy to detect this text pattern and consider that a revert, but not everyone uses the rollback link. From the Evolution page’s history, here’s another revert string: “Reverting possible vandalism by When nine to version by Hubrid Noxx. False positive? Report it. Thanks, ClueBot. (677946) (Bot)”. This string comes from ClueBot. Other admins could use something else. Some may be operating entirely unassisted: if they see two or more incidents of vandalism (thereby making the rollback shortcut unusable), they’ll go back to the last clean edit and save it again. What’ll they put in the change comment field? I don’t know. I can’t guess reverts from human language.

By looking at the revision comments, one could:

  1. Get a reasonable picture of reverts using known automated tools,
  2. Completely miss reverts that didn’t use those tools, and
  3. Pick up false positives from vandals claiming to have performed a revert when they didn’t.

The last one makes this particularly vulnerable to gaming, should a tool built around this method become key to fighting vandalism.

There is another approach: by analysing the text of each revision. If a given revision differs from the earlier revision but matches the one before it (or two revisions earlier), you have a revert. Some forms of vandalism involve a complete replacement of the page’s contents. Others are just an additional phrase or link. Revert detection could be on the basis of either substantial similarity to an earlier revision, or of being exactly the same.

That’s the theory. In practice, pulling the full text of each revision from Wikipedia is painfully slow. I’ve tried this already, looking for what words were added and removed with each revision. Here’s the code. One month of revisions for Evolution takes about ten minutes on my connection. A longer analysis on multiple pages could easily overwhelm resources.

WikiMedia puts out database dumps for download, for exactly this sort of analysis. In this case, the dump we want, enwiki-latest-pages-meta-history.xml.bz2 from this folder, is a cool 147.7G, heavily compressed. Even if I had the bandwidth to download it in anywhere under a month, dealing with a dataset that large will take an entirely different order of code-chops. But I’d like to get there. :)

For now, the hit-or-miss fuzzy comment match is what we have to live with.

Needless to say, my use of the term “vandalism” paints this as far more black-and-white, us-versus-them than it really is. I use it only to simplify the analytical viewpoint.

Charting Wikipedia edits

Hans wanted to calculate a 7-day moving average of edits on any given article across a year. Here’s what it looks like for the Evolution page:

Evolution edit chart

Here’s the data for the chart and source code. Command line invocation:

python 3-moving-average-edits.py Evolution -s 2008-04-25 -e 2009-05-01 -o evolution-1yr.csv

Analysing Wikipedia

Earlier this year I began a project with the Centre for Internet and Society analysing editing behaviour on Wikipedia. We’re hoping to find statistical patterns indicative of pack editing behaviour, and to build tools for editors to counter vandals. My collaborator Hans Varghese Mathews does the heavy statistical analysis. I do the code.

Here’s what we’ve got so far:

The blog

The project’s official blog is at CIS. Current posts include an introduction and a description of how Hans did his first statistical analysis (also available as a PDF rendering from the original in Latex).

The code

For our first two attempts, we looked at (a) an article’s editors and all other articles they edited in a given time period, and (b) the words they added or removed from an article. The code’s written in Python using the mwclient library. Source code for the two analyses, under the liberal New BSD license.

Other work

There’s a ton of interesting work around Wikipedia ranging from edit stream analysis to estimation of article quality to understanding Wikipedia’s influence on popular culture. Here’s my collection of references. We’d like to build on existing work but not replicate it, so keeping track is important.

What’s next

I’ll describe how the existing two pieces of code work in the following posts, leading up to a third analysis. You can follow the posts in this blog either via the Research category or with the cis-wikipedia tag. Both categories and tags on my blog have their own feeds that you can subscribe to. You’ll see a link to the feed in the right-hand corner of your browser’s address bar.