Entries tagged “wikipedia”

Being an outsider

Last evening I sat across a physicist and a mathematician and watched them discuss clusterings of Wikipedia editors based on edit behaviour. Snatches of familiar but meaningless phrases hit my ears. Markov chains. Undirected graphs. Distances. Eventually the physicist squealed in delight and said she had won a bet with the mathematician. I nodded. Then they said “computationally expensive” and I took my cue and pointed out that for an extended period of revision history, one could take a given revision and consider that editor’s other edits only within a small window rather than across the entire period. That would cut clutter from the dataset and allow long term analysis. We only need to agree on what the window’s size should be. We could even come up with a way to identify a pair of editors responding to each other, as against working independently to contribute new material or clean up a page.

And thereby having said something intelligent, I sat back and watched their faces again, slipping back into incomprehension. We parted agreeing to keep in touch on the new ideas, but I’m at a loss to tell you exactly what the new ideas are. Their math makes no sense to me, for I’m an outsider: the chap butting his way into a discipline claiming to have some solutions, but with no understanding of the fundamentals.

The previous day I had a most fascinating conversation with one of the presenters at WikiWars, the significance of whose insight was again wasted on me. He talked of Edward Said and Satyajit Ray, of the latter’s biography on Wikipedia, the trouble with too many of the citations referring to a single biographer, and of how that could be understood in the context of Said’s work. He recommends Said’s Culture and Imperialism. I can feel the warmth from a dim bulb glowing somewhere.

He asked about me. I said I’ve spent the last few years in the rural development space. “Fooled around,” is more like it, for I went into the space armed with claims of pioneering web development experience and programming prowess, and found the most intense technical task they had was to install an operating system, open a web browser, point it at a government website, and explain to all parties concerned whose fault it was that the page wasn’t loading. Day-to-day life revolved around the size of the cash float, which investor was willing to fund it, scheduling meetings with the ISP for CEO-to-CEO face-offs on how a screenshot of our bandwidth consumption was insufficient, and visiting the very abrasive government bureaucrat to assure him that I did indeed have top-notch programmers working full time to bring him his daily report. Stick some Python in there to make it all better, will ya?

Which is why when I met the geeky young man working towards a PhD in agriculture, you will understand why I begged him to recommend a book that explained all this. There has to be some intelligence in this chaos, but I’m too much of an outsider to spot it.

I’m a programmer, I keep telling myself. I write code. Good code. Fast code. All these people waving their arms and speaking a strange dialect of English need me because, on the internet, code talks like nothing else. I can sit cluelessly around them, bewildered even, knowing that in the end someone will turn to me and ask if I can help.

Conversations move on. An hour later, at another location, the physicist says she’s working on a doctoral thesis. I say that nearly everyone in my life has a PhD or is working on one. I would have been too, if it wasn’t such a long, circuitous route. How am I going to justify trekking all the way through undergrad at this age just to get to the interesting bits? In academia, I’m the ultimate outsider. I’ve never been through any of their systems, turn up as this chap that no one is quite sure how to engage with, and yet have gained entry to more than one of their circuits and even published papers. The geek hat does carry one far.

The geek hat is also suspected. Bangalore’s ruined by the techies, they wail. I’ve been to endless meetings on problems that wouldn’t exist if they used Firefox instead of Internet Explorer, or something as trivial, except the Mozilla Foundation isn’t making an offer to fund a major e-governance project. I keep my mouth shut. People in the habit of routinely shooting at feet will eventually shoot their own, and then they won’t turn up at the next meeting. Suspicion of techies and the biases behind their ideas carries all the way into the realm of the bizarre. At a music concert one evening, this dear old lady, proud of her daughter who wrote for an advertising supplement, didn’t ask what I did. She didn’t want bad news. She simply said “don’t tell me you’re a techie.” A friend jumped to my defence, pointing to the camera and explaining that I was a photographer. I played along, for revealing that you’re a techie generally tends to make life more expensive in these parts, and I was foraying into yet another new discipline. A few years have passed and I’ve clicked much. Today I no longer wield a camera but still wear the geek hat.

At dinner last, the wikipedian from Taiwan made conversation. He had helped launch a minority language Wikipedia that the official system of language Wikipedias wouldn’t recognise and had successfully lobbied for its inclusion. He wanted to interview me for the wikipedians back home. As a local Wikipedia editor, how did I relate to the English language Wikipedia? But wait, me representing the local editors? With just a hundred odd edits on my account when the local chapter had editors with 50,000+ edits? I made the call to another (real) Wikipedian asking if he was in the neighbourhood. He suggested I go ahead anyway since I was a valid rep.

Later still, the Taiwanese wikipedian asked that fatal question: “So, what do you do?” I responded with the one-liner I reserve for such occasions. “I’m a programmer, I write code.” He pointed at my shirt. “You work for Yahoo?” No, I said, “that’s just a conference t-shirt.” I then attempted a weak explanation of my rural development stint.

The truth is, in the eleven plus years of my working life I’ve never worked at a software house, have never attended a computer class, and have no certifications. I wrote code through the ’90s, code and little else, telling everyone I was going to be a “software developer” when I grew up, and ultimately falling out of the academic system. But when it came to going to work, did I do the expected thing and join a software house? No, sir, I went into print publishing. What one does first sets the template, and this one sure did. I’ve put my foot into all manner of disciplines other than computer science, playing the saviour who produces the code, but bearing no certifications. I could afford it because I had put in my 10,000 hours already. After that much exposure, learning becomes automatic and incremental. I haven’t looked at a technology guide book in over a decade because I don’t need to. The book on my bedside today is on law. The one below it on film studies.

An increasingly ragged hat

My expeditions into new disciplines have gotten deeper and longer over the years, but they’ve also taken me farther away from the primary identity I’ve defined for myself. The last major piece of code I wrote was in 2002. Everything since has been relatively minor scripting. My open source code contribution track record is astonishingly sparse. I’ve gained proficiency at just one new programming language in the ’00s, down from five in the ’90s. I regularly encounter bewildering new technical constructs these days. It’s bad enough to feel like retirement.

I’m slowly, but surely, being ejected from the one discipline I considered myself an insider at. What’s one to do?

I suppose this is the part where life gets really interesting.

Analysing Wikipedia: caching data

I haven’t posted about Wikipedia in a while. Hans went to Ladakh right after I returned, so we’re only now getting around to analysing the data we collected in July.

Our biggest hassle with doing any kind of analysis is with how long it takes to retrieve data. A full text analysis of a few hundred revisions of a large page could easily take an hour to pull. If that analysis doesn’t produce satisfying results, attempting a variation requires pulling that data all over again, because we have no cache.

I use the mwclient library, which provides a thin wrapper representing MediaWiki queries as lazy (?) Python sequences. Since this sequence could be cached, I’ve been considering strategies (some of this assumes familiarity with mwclient):

  1. Implement a simple Python dictionary cache around the mwclient API, saving the query→result mapping as a pickled dump and consult that before hitting the servers again. This is easy, but since the sequences are lazy, the data isn’t available for caching until the code tries to access it. The cache has to intervene then. All my analysis code must now be written for two API’s, mwclient’s and my cache’s.

  2. Alternatively, do the same thing but as a patch to mwclient’s code so there’s a single external API. This requires understanding how it works and maintaining patches against upstream changes, which takes time away from analysis.

  3. Do it outside. Setup Squid or another caching proxy to cache everything regardless of HTTP headers. Make queries through this. Easy to setup, but grossly inefficient. Proxy servers understand request→response mapping, not sequences. If I ask for a subset of an earlier sequence, it’ll treat it as a new request. Sequences require special treatment:

    • There could be newer edits on the site, making the sequence’s beginning and end markers stale.

    • A new query may ask for overlapping results (typically, a query from a fixed starting point to current time). The cache should be able to join sequences instead of duplicating data.

    • A query may ask for the same time range as an earlier query, but with additional properties (typically, the full text of each revision). These additional properties should be merged into the cache.

  4. Drop this approach altogether and get a static dump of the Wikipedia database. But a full text dump of the entire revision history of the English Wikipedia is 150 terabytes. The resource requirements will take us out of the realm of a hobbyist project.

Given that data retrieval time has become a serious hobble, it seems worth tackling this head-on. A custom cache API could:

  1. Be sequence aware. Treat each MediaWiki article as being a sequence of unknown start and end, of which fragments are available in the cache. Join sequence fragments as data gaps are filled in, leading to one single sequence for the page’s entire revision history.

  2. Store additional properties on each revision. MediaWiki does not store diffs between revisions, but the cache could, since much full text analysis is based around the changes introduced by each revision. Properties could also be flags marking pages as, for example, vandalism, or the following reversion.

  3. Based on the above, store alternate sequences and properties specific to them. For example, a revision sequence of an article that skips all vandal/reversion revisions and stores edit diffs without them. Without this, an editor whose sole contribution was to revert vandalism will come out appearing to have added a lot of new material.

A web service implementing this API will, over time, be able to respond to queries in near real time, making it possible to build a web interface where anyone can submit a query. The public web interface is one of our eventual goals for this project.

I’ll post updates as I work out the technical architecture for this API. I’m considering using one of the newfangled key/value pair databases but have no experience with them. Recommendations are welcome.

Seven and a half years of Evolution

To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.

Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…

The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.

Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.

Analysis to follow.

Editors and edits

Hans and I met up this evening to discuss the moving average data I had collected. “But what about editors?”, he asked. So I extended it to get that too. Here’s the data. (Hat tip to Vaibhav Bhawsar, who also pointed that out.)

This chart is a visual mess. It’s also close to the limits of how much data I can pass the Google Chart API, so I’ll need a better system in place for the next round, something that allows zooming in for closer analysis. For what it’s worth, here are the key things about this chart:

  1. The blue Moving Window line is now the sum of the preceding seven day period, not the average.
  2. The dark gray Editors in Window line is the number of unique editors within each window.
  3. The y-axis labels are off by a little bit. I can’t figure out why they are not properly calibrated.
  4. Edit Count and Editor Count hug each other closely, but have clearly visible differences in the moving window.

Edit history of Evolution on Wikipedia

If you look at the three biggest blue peaks, the first on June 11 (54) and third on March 1 (59) have a large number of editors (27 and 24), while the peak of August 25 (50) has only 11 editors. You may recall from the last graph that August 25 was the day of the highest number of edits in the year.

Hans thinks that if we render a scatter graph plotting 1/edits-in-window for the x-axis and editors-in-window/edits-in-window for the y-axis, the first and third peaks will show up close to (0,1).

Assuming this works as predicted, we’re close to building a first level user-facing analysis tool: give it a page and a date range, and it’ll tell you approximately when there was an edit war, for closer inspection using content analysis.

Detecting reverts in Wikipedia

Sirtaj Singh Kang asked on Twitter:

Are you able to detect reverts? That would be able to help show edit wars. Perhaps a graph of edit/revert avg?

Short answer: no and yes. Reverts are not marked in the database. They can be detected, but the effort is non-trivial.

MediaWiki has for some time shown a “rollback” link on the latest revision in the history, visible to logged in users. If you click it, MediaWiki will do the following:

  1. Make a new revision of the page copying the content of the revision before the last.
  2. Annotate this new revision as a minor change, with the comment “Reverted edits by (user) to last revision by (earlier user)”.

It’s easy to detect this text pattern and consider that a revert, but not everyone uses the rollback link. From the Evolution page’s history, here’s another revert string: “Reverting possible vandalism by When nine to version by Hubrid Noxx. False positive? Report it. Thanks, ClueBot. (677946) (Bot)”. This string comes from ClueBot. Other admins could use something else. Some may be operating entirely unassisted: if they see two or more incidents of vandalism (thereby making the rollback shortcut unusable), they’ll go back to the last clean edit and save it again. What’ll they put in the change comment field? I don’t know. I can’t guess reverts from human language.

By looking at the revision comments, one could:

  1. Get a reasonable picture of reverts using known automated tools,
  2. Completely miss reverts that didn’t use those tools, and
  3. Pick up false positives from vandals claiming to have performed a revert when they didn’t.

The last one makes this particularly vulnerable to gaming, should a tool built around this method become key to fighting vandalism.

There is another approach: by analysing the text of each revision. If a given revision differs from the earlier revision but matches the one before it (or two revisions earlier), you have a revert. Some forms of vandalism involve a complete replacement of the page’s contents. Others are just an additional phrase or link. Revert detection could be on the basis of either substantial similarity to an earlier revision, or of being exactly the same.

That’s the theory. In practice, pulling the full text of each revision from Wikipedia is painfully slow. I’ve tried this already, looking for what words were added and removed with each revision. Here’s the code. One month of revisions for Evolution takes about ten minutes on my connection. A longer analysis on multiple pages could easily overwhelm resources.

WikiMedia puts out database dumps for download, for exactly this sort of analysis. In this case, the dump we want, enwiki-latest-pages-meta-history.xml.bz2 from this folder, is a cool 147.7G, heavily compressed. Even if I had the bandwidth to download it in anywhere under a month, dealing with a dataset that large will take an entirely different order of code-chops. But I’d like to get there. :)

For now, the hit-or-miss fuzzy comment match is what we have to live with.

Needless to say, my use of the term “vandalism” paints this as far more black-and-white, us-versus-them than it really is. I use it only to simplify the analytical viewpoint.

Charting Wikipedia edits

Hans wanted to calculate a 7-day moving average of edits on any given article across a year. Here’s what it looks like for the Evolution page:

Evolution edit chart

Here’s the data for the chart and source code. Command line invocation:

python 3-moving-average-edits.py Evolution -s 2008-04-25 -e 2009-05-01 -o evolution-1yr.csv

Querying Wikipedia with mwclient

mwclient is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we’ll look at how we can use mwclient to query any MediaWiki-powered site for the information we want.

Installing mwclient

As of this writing, the 0.6.2 release of mwclient does not include an installer and isn’t available in the Python Package Index, so installation is a bit of a chore. Grab the latest release from the downloads page; it should uncompress to reveal an mwclient folder. Copy this folder to your Python’s site-packages folder. If you don’t know where that is, type this at the command line:

python -c "from distutils.sysconfig import get_python_lib; print get_python_lib()"

The following locations are typical:

/usr/lib/python2.x/site-packages
/var/lib/python-support/python2.x
/Library/Python/2.x/site-packages

Launch Python and confirm it’s installed:

>>> import mwclient

If that didn’t raise any errors, congratulations! You’re all set to go.

Using mwclient

Here’s how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:

>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
...     rev = revisions.next()
...     pprint(rev)
... 
{'revid': 290932490,
 'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
 'user': 'Benlisquare'}
{'anon': '',
 'revid': 290930263,
 'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
 'user': '62.254.235.147'}
{'anon': '',
 'revid': 290930082,
 'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
 'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
 'revid': 290927544,
 'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
 'user': 'SoxBot'}
{'anon': '',
 'revid': 290927187,
 'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
 'user': '62.254.235.147'}

Compare the output you get with the page’s revision history on Wikipedia. They should match.

Calling page.revisions() gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon key indicates an anonymous edit; user then contains the editor’s IP address instead of user name. All keys and string values will be Unicode strings.

To get all edits between two dates in the forward direction, with the text content of each revision, do this:

>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
...                            end='2009-05-19T23:59:59Z',
...                            dir='newer',
...                            prop='ids|timestamp|flags|comment|user|content')

And here’s how to get all the edits of any given user. Let’s look at SoxBot from the revisions above:

>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
...     rev = contribs.next()
...     pprint(rev)
... 
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 17244650,
 'revid': 290942689,
 'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
 'title': 'User talk:Twinzor',
 'top': '',
 'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 21352732,
 'revid': 290942678,
 'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
 'title': 'User talk:Turco85',
 'top': '',
 'user': 'SoxBot'}

Notes

  1. MediaWiki timestamp strings can be generated using "%Y-%m-%dT%H:%M:%SZ" as format string with Python’s datetime.strftime. All timestamps must be in UTC.

  2. You can pass a combination of parameters to page.revisions() to get revisions the way you want them. You can even skip the dates and call with startid or endid = any revision number (see revid in the output), to retrieve revisions before or after that one.

  3. To look at what parameters the page.revisions() and site.usercontributions() functions take, use Python’s built-in help browser:

    >>> help(page.revisions)
    >>> help(site.usercontributions)
    

Hope that’s enough to get you started. In subsequent posts I’ll explain how we can use this to analyse Wikipedia revision history.

Analysing Wikipedia

Earlier this year I began a project with the Centre for Internet and Society analysing editing behaviour on Wikipedia. We’re hoping to find statistical patterns indicative of pack editing behaviour, and to build tools for editors to counter vandals. My collaborator Hans Varghese Mathews does the heavy statistical analysis. I do the code.

Here’s what we’ve got so far:

The blog

The project’s official blog is at CIS. Current posts include an introduction and a description of how Hans did his first statistical analysis (also available as a PDF rendering from the original in Latex).

The code

For our first two attempts, we looked at (a) an article’s editors and all other articles they edited in a given time period, and (b) the words they added or removed from an article. The code’s written in Python using the mwclient library. Source code for the two analyses, under the liberal New BSD license.

Other work

There’s a ton of interesting work around Wikipedia ranging from edit stream analysis to estimation of article quality to understanding Wikipedia’s influence on popular culture. Here’s my collection of references. We’d like to build on existing work but not replicate it, so keeping track is important.

What’s next

I’ll describe how the existing two pieces of code work in the following posts, leading up to a third analysis. You can follow the posts in this blog either via the Research category or with the cis-wikipedia tag. Both categories and tags on my blog have their own feeds that you can subscribe to. You’ll see a link to the feed in the right-hand corner of your browser’s address bar.