Archive for May 2009

Two Bits in EPub format

Chris Kelty’s book, Two Bits: The Cultural Significance of Free Software is available as a free download in PDF and HTML formats. Neither version is suitable for reading on a handheld or ebook reader, so I’ve made an EPub version. EPub is an XHTML-based self-contained single file document format. Download here. This is how I made it from the HTML version:

Read on...

Hypothesis: edit wars have a lot of edits, but few editors

This is a post by guest blogger Guillaume Marceau from http://gmarceau.qc.ca/blog/

Chart of edit wars over the Wikipedia article on evolution

Click on the chart to see the full-sized version.

The essence when making charts is to make use of our eyes’ fantastic ability to compare amounts across a page. In fact, our eyes are so good at these kinds of comparison that a chart will often read better if two data sets are placed next to each other (aligned with a common axis) rather than overlayed. Edward Tufte popularized this idea in his book Envisioning Information by calling it a ‘small multiples’ design. This simple idea is surprisingly versatile. For instance, it can inform the design of a user interface.

A well-designed chart should read like a book so that it may become an integral part of the narrative in support of the argument. In the chart above, the labels are disposed in normal reading order, from left to right, top to bottom. As we read, we understand the nature of the computation that was applied, and the significance of the three humps in the last chart.

Changes committed to Zine upstream

I’ve pushed my patches to Zine to the upstream repository, making for a total of 45 changesets with 106 changes to 50 files (as per Mercurial). The main repository is here.

If you’ve been working with my repository though, you should stick with it because there is one fundamental incompatibility. My version stores the extra field on Post and User objects as JSON. The main repo stores as a Python pickle. Armin and I have agreed that while JSON is the way to go forward, we should switch via a database migration framework, which Zine doesn’t have yet.

Until that framework is in place and the merge happens, I’ll continue to sync changesets across the two repositories.

Editors and edits

Hans and I met up this evening to discuss the moving average data I had collected. “But what about editors?”, he asked. So I extended it to get that too. Here’s the data. (Hat tip to Vaibhav Bhawsar, who also pointed that out.)

This chart is a visual mess. It’s also close to the limits of how much data I can pass the Google Chart API, so I’ll need a better system in place for the next round, something that allows zooming in for closer analysis. For what it’s worth, here are the key things about this chart:

  1. The blue Moving Window line is now the sum of the preceding seven day period, not the average.
  2. The dark gray Editors in Window line is the number of unique editors within each window.
  3. The y-axis labels are off by a little bit. I can’t figure out why they are not properly calibrated.
  4. Edit Count and Editor Count hug each other closely, but have clearly visible differences in the moving window.

Edit history of Evolution on Wikipedia

If you look at the three biggest blue peaks, the first on June 11 (54) and third on March 1 (59) have a large number of editors (27 and 24), while the peak of August 25 (50) has only 11 editors. You may recall from the last graph that August 25 was the day of the highest number of edits in the year.

Hans thinks that if we render a scatter graph plotting 1/edits-in-window for the x-axis and editors-in-window/edits-in-window for the y-axis, the first and third peaks will show up close to (0,1).

Assuming this works as predicted, we’re close to building a first level user-facing analysis tool: give it a page and a date range, and it’ll tell you approximately when there was an edit war, for closer inspection using content analysis.

Detecting reverts in Wikipedia

Sirtaj Singh Kang asked on Twitter:

Are you able to detect reverts? That would be able to help show edit wars. Perhaps a graph of edit/revert avg?

Short answer: no and yes. Reverts are not marked in the database. They can be detected, but the effort is non-trivial.

MediaWiki has for some time shown a “rollback” link on the latest revision in the history, visible to logged in users. If you click it, MediaWiki will do the following:

  1. Make a new revision of the page copying the content of the revision before the last.
  2. Annotate this new revision as a minor change, with the comment “Reverted edits by (user) to last revision by (earlier user)”.

It’s easy to detect this text pattern and consider that a revert, but not everyone uses the rollback link. From the Evolution page’s history, here’s another revert string: “Reverting possible vandalism by When nine to version by Hubrid Noxx. False positive? Report it. Thanks, ClueBot. (677946) (Bot)”. This string comes from ClueBot. Other admins could use something else. Some may be operating entirely unassisted: if they see two or more incidents of vandalism (thereby making the rollback shortcut unusable), they’ll go back to the last clean edit and save it again. What’ll they put in the change comment field? I don’t know. I can’t guess reverts from human language.

By looking at the revision comments, one could:

  1. Get a reasonable picture of reverts using known automated tools,
  2. Completely miss reverts that didn’t use those tools, and
  3. Pick up false positives from vandals claiming to have performed a revert when they didn’t.

The last one makes this particularly vulnerable to gaming, should a tool built around this method become key to fighting vandalism.

There is another approach: by analysing the text of each revision. If a given revision differs from the earlier revision but matches the one before it (or two revisions earlier), you have a revert. Some forms of vandalism involve a complete replacement of the page’s contents. Others are just an additional phrase or link. Revert detection could be on the basis of either substantial similarity to an earlier revision, or of being exactly the same.

That’s the theory. In practice, pulling the full text of each revision from Wikipedia is painfully slow. I’ve tried this already, looking for what words were added and removed with each revision. Here’s the code. One month of revisions for Evolution takes about ten minutes on my connection. A longer analysis on multiple pages could easily overwhelm resources.

WikiMedia puts out database dumps for download, for exactly this sort of analysis. In this case, the dump we want, enwiki-latest-pages-meta-history.xml.bz2 from this folder, is a cool 147.7G, heavily compressed. Even if I had the bandwidth to download it in anywhere under a month, dealing with a dataset that large will take an entirely different order of code-chops. But I’d like to get there. :)

For now, the hit-or-miss fuzzy comment match is what we have to live with.

Needless to say, my use of the term “vandalism” paints this as far more black-and-white, us-versus-them than it really is. I use it only to simplify the analytical viewpoint.

Charting Wikipedia edits

Hans wanted to calculate a 7-day moving average of edits on any given article across a year. Here’s what it looks like for the Evolution page:

Evolution edit chart

Here’s the data for the chart and source code. Command line invocation:

python 3-moving-average-edits.py Evolution -s 2008-04-25 -e 2009-05-01 -o evolution-1yr.csv

Map of scripts

The SIL home page has a hand-drawn map of scripts around the world. It’s interesting how South and Southeast Asia have the highest density of variations.

Scripts around the world

SIL hosts the Open Font License, a key component of getting to an era of high quality typography on the web using embedded fonts, without resorting to kludges like sIFR. More on that in a bit.

Querying Wikipedia with mwclient

mwclient is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we’ll look at how we can use mwclient to query any MediaWiki-powered site for the information we want.

Installing mwclient

As of this writing, the 0.6.2 release of mwclient does not include an installer and isn’t available in the Python Package Index, so installation is a bit of a chore. Grab the latest release from the downloads page; it should uncompress to reveal an mwclient folder. Copy this folder to your Python’s site-packages folder. If you don’t know where that is, type this at the command line:

python -c "from distutils.sysconfig import get_python_lib; print get_python_lib()"

The following locations are typical:

/usr/lib/python2.x/site-packages
/var/lib/python-support/python2.x
/Library/Python/2.x/site-packages

Launch Python and confirm it’s installed:

>>> import mwclient

If that didn’t raise any errors, congratulations! You’re all set to go.

Using mwclient

Here’s how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:

>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
...     rev = revisions.next()
...     pprint(rev)
... 
{'revid': 290932490,
 'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
 'user': 'Benlisquare'}
{'anon': '',
 'revid': 290930263,
 'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
 'user': '62.254.235.147'}
{'anon': '',
 'revid': 290930082,
 'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
 'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
 'revid': 290927544,
 'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
 'user': 'SoxBot'}
{'anon': '',
 'revid': 290927187,
 'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
 'user': '62.254.235.147'}

Compare the output you get with the page’s revision history on Wikipedia. They should match.

Calling page.revisions() gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon key indicates an anonymous edit; user then contains the editor’s IP address instead of user name. All keys and string values will be Unicode strings.

To get all edits between two dates in the forward direction, with the text content of each revision, do this:

>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
...                            end='2009-05-19T23:59:59Z',
...                            dir='newer',
...                            prop='ids|timestamp|flags|comment|user|content')

And here’s how to get all the edits of any given user. Let’s look at SoxBot from the revisions above:

>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
...     rev = contribs.next()
...     pprint(rev)
... 
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 17244650,
 'revid': 290942689,
 'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
 'title': 'User talk:Twinzor',
 'top': '',
 'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 21352732,
 'revid': 290942678,
 'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
 'title': 'User talk:Turco85',
 'top': '',
 'user': 'SoxBot'}

Notes

  1. MediaWiki timestamp strings can be generated using "%Y-%m-%dT%H:%M:%SZ" as format string with Python’s datetime.strftime. All timestamps must be in UTC.

  2. You can pass a combination of parameters to page.revisions() to get revisions the way you want them. You can even skip the dates and call with startid or endid = any revision number (see revid in the output), to retrieve revisions before or after that one.

  3. To look at what parameters the page.revisions() and site.usercontributions() functions take, use Python’s built-in help browser:

    >>> help(page.revisions)
    >>> help(site.usercontributions)
    

Hope that’s enough to get you started. In subsequent posts I’ll explain how we can use this to analyse Wikipedia revision history.

Analysing Wikipedia

Earlier this year I began a project with the Centre for Internet and Society analysing editing behaviour on Wikipedia. We’re hoping to find statistical patterns indicative of pack editing behaviour, and to build tools for editors to counter vandals. My collaborator Hans Varghese Mathews does the heavy statistical analysis. I do the code.

Here’s what we’ve got so far:

The blog

The project’s official blog is at CIS. Current posts include an introduction and a description of how Hans did his first statistical analysis (also available as a PDF rendering from the original in Latex).

The code

For our first two attempts, we looked at (a) an article’s editors and all other articles they edited in a given time period, and (b) the words they added or removed from an article. The code’s written in Python using the mwclient library. Source code for the two analyses, under the liberal New BSD license.

Other work

There’s a ton of interesting work around Wikipedia ranging from edit stream analysis to estimation of article quality to understanding Wikipedia’s influence on popular culture. Here’s my collection of references. We’d like to build on existing work but not replicate it, so keeping track is important.

What’s next

I’ll describe how the existing two pieces of code work in the following posts, leading up to a third analysis. You can follow the posts in this blog either via the Research category or with the cis-wikipedia tag. Both categories and tags on my blog have their own feeds that you can subscribe to. You’ll see a link to the feed in the right-hand corner of your browser’s address bar.

Housing and mobility

Isn’t it remarkable that you can lock up a house and return one or two weeks later, and find that everything is still in order? No rainwater leaking on to the floor, no bathroom taps running, no mold growing out of the fridge, and your broadband modem not blown up by lightning?

Leaving home with no one to look after it is as simple as clearing the fridge of perishable food, turning off the lights and locking the door. You don’t even need pack food for the journey anymore.

Was this even possible two decades ago? Could you make a trip when you felt like it, without elaborate preparation? How much has this contributed to changing our notions of social mobility?

Remembering LiveJournal

Pradeep Gowda wrote to me on Twitter:

I was reading through your old journal entries circa 2000. Looks like you were a tweeter even back then .. ;)

Here’s a screenshot of my desktop from April 2000 (click for full size):

LoserJabber

See that text box with a submit button? That was LoserJabber (since renamed LogJam), the LiveJournal client for Linux. It was designed to sit in a corner of your screen so you could type into it every once in a while to describe what you were doing. That’s how everyone posted back then.

LiveJournal was the Twitter of ten years ago! Seriously, the number of things that came out of LiveJournal – the activity-oriented social graph, event sync, memcached, OpenID, threaded commenting, userpics – make it worthy of far more respect than it gets these days.

(Aside: yes, that’s GNOME 1.0 in that screenshot, and yes, it had brushed metal long before Mac OS X.)

On focus

Last week I met Abhimanyu Chirimar and in the course of a long winding conversation, said that if there was anything I had learnt, it is that it is not possible to multi-task. One can do only one thing at a time well. He didn’t buy it but let it pass. Tonight, this quote:

The Technician is the doer.

“If you want it done right, do it yourself” is The Technician’s credo.

The Technician loves to tinker. Things are to be taken apart and put back together again. Things aren’t supposed to be dreamed about, they’re supposed to be done.

If The Entrepreneur lives in the future and The Manager lives in the past, The Technician lives in the present. He loves the feel of things and the fact that things can get done.

As long as The Technician is working, he is happy, but only on one thing at a time. He knows that two things can’t get done simultaneously; only a fool would try. So he works steadily and is happiest when he is in control of the work flow.

As a result, The Technician mistrusts those he works for, because they are always trying to get more work done than is either possible or necessary.

— Michael E. Gerber in The E-Myth Revisited, describing the small business owner’s three main personalities of Entrepreneur, Manager and Technician, and how they conflict with each other.

Abhi’s an entrepreneur and I’m a technician. Gerber’s book is for technicians who decide they want to be entrepreneurs, only to find it conflicts with their technician personality.

On markup

I was looking up markup syntaxes recently and realised that one of the reasons I find writing in HTML tedious – apart from the verbosity – is that annotations are loaded up front. Consider this example dummy link pointing back at this post. If I were writing in HTML, I’d write:

Consider this <a href="/2009/09/15/on-markup">example dummy link</a>
pointing back at this post.

Contrast with popular markup languages, say Markdown (in two variations):

1. Consider this [example dummy link](/2009/05/15/on-markup)
   pointing back at this post.

2. Consider this [example dummy link][link] pointing back at this post.

[link]: /2009/05/15/on-markup

Or reStructuredText (two variations):

1. Consider this `example dummy link </2009/05/15/on-markup>`__
   pointing back this post.

2. Consider this `example dummy link`__ pointing back at this post.

__ /2009/05/15/on-markup

Notice how in HTML, the link target, which is an annotation on the text, appears before the text itself? It forces you to break the flow of thought and switch from writing mode to editing mode.

It gets worse when you have more content between the opening and closing.

Some background: I’ve spent a few weeks working on Zine, the blog engine powering this blog. Zine supports multiple markup formats, with the ability to add more via a plugin API. This post is written in Markdown, older posts are in reStructuredText, while the posts imported from LiveJournal use a parser I contributed that converts LJ’s markup to HTML. Because parsers cannot be trusted to be time-efficient, they are called once and the results cached in the database alongside the original text. The cached version is shown when a page is rendered.

Most blogs also have a feature to show only part of a post on the index page and the full text on the post’s own page. In WordPress, this happens when the user places a <!--more--> tag in their text, while LiveJournal uses <lj-cut text="optional cut text">. Zine needs a way to handle this across markup parsers. It does it by implementing a derivative of HTML called, appropriately, Zine Extensible Markup Language, which is basically the same thing as HTML but with a new <intro> tag and some other small features. If you’d like to introduce with a paragraph, cut, then write the rest, it’ll be something like this in ZEML:

<intro>
  <p>
    This is my introductory paragraph.
  </p>
</intro>
<!-- At this point, Zine will cut the post with a Read More link. -->
<p>
  And this is the rest of the post.
</p>

But what if you have a long intro leading to an enticing cut-line? You’ll get:

<intro text="and now, unveil the mystery...">
  <p>
    Blah, blah, blah and blah
  </p>
  <p>
    ...
    ...
    Even more blah
    ...
    ...
    Blah up to a compelling lead-in.
  </p>
</intro>

Whoops! The line that should have been last in your writing sequence moved all the way up to the top.* This is like top-posting in email, but worse: you write your own lines in reverse order.

Consensus in the CMS industry seems to be heading towards discouraging HTML markup and WYSIWYG editors, ushering users instead towards markup languages and limited-ability WYSIWYM editors. I can see why.

* Zine doesn’t actually support the text attribute on the intro tag, but I may add it.

Transaction alerting

After paying my utility bills this morning, I pulled out my phone, half-expecting an SMS alert confirming the transaction. None arrived. Of course, I told myself, I had paid in cash. There was nothing correlating the cash to my phone number. The bills were for a rented apartment; not in my name.

And yet, it was a bit disconcerting. Transaction alerts for card transactions, ATM withdrawals and mobile bill payments/top-ups are so commonplace now that it feels incomplete to not receive one.

Why is this so? After I had handed over cash at the payment centre, I had waited to collect a paper receipt. The receipt represented several things:

  1. Confirmation that the cash was going to the utility company and not the cashier’s pocket;
  2. To a younger, adolescent self, confirmation to parent that I had completed the assigned task;
  3. A record for my accounts and for applicable reimbursements; and
  4. The sense that if a piece of paper with some numbers printed on it could be acceptable to all parties, then all is normal.

The SMS alert meets all but the third and adds an additional representation:

  • Confirmation that the transaction loop is complete, that the payment has gone into the authoritative record, not just the cashier’s computer.

Something to think about when designing for user experience.

QotD 2

I’m on a TED Talks marathon tonight:

Now there’s another play history that I think is a work in progress. Those of you who remember Al Gore during the first term and then during his successful but unelected run for the presidency, may remember him as being kind of wooden and not entirely his own person, at least in public. And looking at his history, which is common in the press, it seems to me, at least, looking at it from a shrink’s point of view, that a lot of his life was programmed.

Summers were hard hard work in the sea(?) heat of the Tennessee summers; he had the expectations of his senatorial father and Washington, DC, and although he had certainly, I think, capacity for play because I do know something about that, he wasn’t as empowered, I think, as he now is, by paying attention to what is his own passion and his own inner drive which I think has its basis in all of us in our play history.

So I would encourage you on an individual level to do, is to explore backwards as far as you can go to the most clear, joyful, playful image that you have whether through a toy or on a birthday or on a vacation, and begin to build from the emotion of that into how that connects with your life now, and you’ll find you may change jobs, which has happened to a number of people when I had them do this, in order to be more empowered through their play, or you’ll be able to invent(?) your life by prioritising it and paying attention to it.

Stuart Brown on the importance of play (at about 17:00 min), describing Al Gore’s personality transformation as he moved on from politics (emphasis mine).

There’s always time to vote

From the European Parliament’s YouTube channel. Funny. Contrast with Jaagore’s campaign in India which sought to project shame on the apathetic voter:

QotD

There’s an old African proverb that says, “If you want to go quickly, go alone. If you want to go far, go together. We need to go far, quickly.”

— Al Gore, speaking on climate change at TED (emphasis mine).

Quickly or far, alone or together? That’s a decision conundrum that applies to so much in life.

New blog

Another year, another migration. Unlikely to be the last ever, but such is life. I’ve had a great time hacking on Zine the last few weeks, so this has been very worthwhile. I’ve backed up my LiveJournal here. I don’t plan to call it quits there, but given how often I write anywhere, that doesn’t mean much. The moblog came too, but because LiveJournal doesn’t yet support exporting comments in communities, it came sans the commentary.

Microwave for ice cream?

The user guide on this microwave oven suggests “reheating” to -10C for ice cream. Who’s this made for? Eskimos?
Image from phone camera.

Bangalore traffic alert - ha ha ha!

Sorry for the bad angle. The sun was directly behind this.
Image from phone camera.

Coconut bouncer

What a great idea! Falling coconuts are a serious hazard in residential areas. My bedroom window got smashed in earlier this year. This thing looks like a bouncer more than a catcher though. How does the coconut bouncing physics work out?
Image from phone camera.

Assumption Hospital?

How does one come up with a name like that?
Image from phone camera.

Thanks, appa!

Image from phone camera.