Friday, May 22, 2009
Detecting reverts in Wikipedia
Sirtaj Singh Kang asked on Twitter:
Are you able to detect reverts? That would be able to help show edit wars. Perhaps a graph of edit/revert avg?
Short answer: no and yes. Reverts are not marked in the database. They can be detected, but the effort is non-trivial.
MediaWiki has for some time shown a “rollback” link on the latest revision in the history, visible to logged in users. If you click it, MediaWiki will do the following:
- Make a new revision of the page copying the content of the revision before the last.
- Annotate this new revision as a minor change, with the comment “Reverted edits by (user) to last revision by (earlier user)”.
It’s easy to detect this text pattern and consider that a revert, but not everyone uses the rollback link. From the Evolution page’s history, here’s another revert string: “Reverting possible vandalism by When nine to version by Hubrid Noxx. False positive? Report it. Thanks, ClueBot. (677946) (Bot)”. This string comes from ClueBot. Other admins could use something else. Some may be operating entirely unassisted: if they see two or more incidents of vandalism (thereby making the rollback shortcut unusable), they’ll go back to the last clean edit and save it again. What’ll they put in the change comment field? I don’t know. I can’t guess reverts from human language.
By looking at the revision comments, one could:
- Get a reasonable picture of reverts using known automated tools,
- Completely miss reverts that didn’t use those tools, and
- Pick up false positives from vandals claiming to have performed a revert when they didn’t.
The last one makes this particularly vulnerable to gaming, should a tool built around this method become key to fighting vandalism.
There is another approach: by analysing the text of each revision. If a given revision differs from the earlier revision but matches the one before it (or two revisions earlier), you have a revert. Some forms of vandalism involve a complete replacement of the page’s contents. Others are just an additional phrase or link. Revert detection could be on the basis of either substantial similarity to an earlier revision, or of being exactly the same.
That’s the theory. In practice, pulling the full text of each revision from Wikipedia is painfully slow. I’ve tried this already, looking for what words were added and removed with each revision. Here’s the code. One month of revisions for Evolution takes about ten minutes on my connection. A longer analysis on multiple pages could easily overwhelm resources.
WikiMedia puts out database dumps for download, for exactly this sort of analysis. In this case, the dump we want, enwiki-latest-pages-meta-history.xml.bz2 from this folder, is a cool 147.7G, heavily compressed. Even if I had the bandwidth to download it in anywhere under a month, dealing with a dataset that large will take an entirely different order of code-chops. But I’d like to get there. :)
For now, the hit-or-miss fuzzy comment match is what we have to live with.
Needless to say, my use of the term “vandalism” paints this as far more black-and-white, us-versus-them than it really is. I use it only to simplify the analytical viewpoint.