Archive for August 2009

Analysing Wikipedia: caching data

I haven’t posted about Wikipedia in a while. Hans went to Ladakh right after I returned, so we’re only now getting around to analysing the data we collected in July.

Our biggest hassle with doing any kind of analysis is with how long it takes to retrieve data. A full text analysis of a few hundred revisions of a large page could easily take an hour to pull. If that analysis doesn’t produce satisfying results, attempting a variation requires pulling that data all over again, because we have no cache.

I use the mwclient library, which provides a thin wrapper representing MediaWiki queries as lazy (?) Python sequences. Since this sequence could be cached, I’ve been considering strategies (some of this assumes familiarity with mwclient):

  1. Implement a simple Python dictionary cache around the mwclient API, saving the query→result mapping as a pickled dump and consult that before hitting the servers again. This is easy, but since the sequences are lazy, the data isn’t available for caching until the code tries to access it. The cache has to intervene then. All my analysis code must now be written for two API’s, mwclient’s and my cache’s.

  2. Alternatively, do the same thing but as a patch to mwclient’s code so there’s a single external API. This requires understanding how it works and maintaining patches against upstream changes, which takes time away from analysis.

  3. Do it outside. Setup Squid or another caching proxy to cache everything regardless of HTTP headers. Make queries through this. Easy to setup, but grossly inefficient. Proxy servers understand request→response mapping, not sequences. If I ask for a subset of an earlier sequence, it’ll treat it as a new request. Sequences require special treatment:

    • There could be newer edits on the site, making the sequence’s beginning and end markers stale.

    • A new query may ask for overlapping results (typically, a query from a fixed starting point to current time). The cache should be able to join sequences instead of duplicating data.

    • A query may ask for the same time range as an earlier query, but with additional properties (typically, the full text of each revision). These additional properties should be merged into the cache.

  4. Drop this approach altogether and get a static dump of the Wikipedia database. But a full text dump of the entire revision history of the English Wikipedia is 150 terabytes. The resource requirements will take us out of the realm of a hobbyist project.

Given that data retrieval time has become a serious hobble, it seems worth tackling this head-on. A custom cache API could:

  1. Be sequence aware. Treat each MediaWiki article as being a sequence of unknown start and end, of which fragments are available in the cache. Join sequence fragments as data gaps are filled in, leading to one single sequence for the page’s entire revision history.

  2. Store additional properties on each revision. MediaWiki does not store diffs between revisions, but the cache could, since much full text analysis is based around the changes introduced by each revision. Properties could also be flags marking pages as, for example, vandalism, or the following reversion.

  3. Based on the above, store alternate sequences and properties specific to them. For example, a revision sequence of an article that skips all vandal/reversion revisions and stores edit diffs without them. Without this, an editor whose sole contribution was to revert vandalism will come out appearing to have added a lot of new material.

A web service implementing this API will, over time, be able to respond to queries in near real time, making it possible to build a web interface where anyone can submit a query. The public web interface is one of our eventual goals for this project.

I’ll post updates as I work out the technical architecture for this API. I’m considering using one of the newfangled key/value pair databases but have no experience with them. Recommendations are welcome.

Upper Dharamsala on a rainy day

I found myself in Mcleodganj last May, in the company of TB Dinesh and Guillaume Marceau. Dinesh wanted to pick up some luggage a friend had left behind in nearby Dharamkot, so off we went up the hill.

Man, was it hard! The incline could have killed me. I was out of breath and my feet ached. My camera bag felt like a huge burden. I had to stop for breath every turn and rest minutes. When we finally reached Dharamkot, I refused to leave the tea shop for the next couple hours. We hung around ordering several rounds of tea and snacks. Dinesh and Guillaume then wanted to walk further, so I reluctantly tagged along. The body ached but the mind couldn’t refuse the challenge. We walked all the way up hill, past prayer flags in the woods, past a shrine to the earlier Panchen Lama, the current Dalai Lama’s late teacher, up to the top, and down again through the Tibetan Children’s Village, along the water pipeline, back to Mcleodganj.

TB Dinesh and Guillaume MarceauSanjay's, DharamkotHike in the HillsTibetan Prayer FlagsTibetan Children's Village, Upper Dharamsala

I had cramps the next day. When I returned to Bangalore and checked my weight, I was down two kilos. In a day’s walk.

And so, a year later and halfway through this year’s resolution to improve health, I had to check again. Was it really so bad, or was I just so out of shape? Has all the cycling in Bangalore and walking in Ladakh’s thin air helped at all?

It has: the walk this time felt like a casual stroll through the woods.

Upper Dharamsala on a rainy day
From atop the hill overlooking the Tibetan Children’s Village (off to the left).

Nikon D70 + kit on sale

I’m putting my Nikon D70 camera and entire kit on sale. It’s served me well over the years and I’m now ready to move on. Here’s the kit contents, what they cost me, and what they’ll cost you if you buy new or individually second hand from eBay (some prices are approximate guesses; all prices are in Indian Rupees).

ItemPurchasedOriginal PriceCurrent NewSecond Hand
TotalRs 85,335Rs 48,680Rs 35,700
My OfferRs 25,000
Nikon D70 body200447,500NA (11,500)11,500
Nikon 50mm f/1.8 D20045,0006,0004,000
Sigma 18-50 f/2.8 EX DC200721,00023,00015,000
Nikon IR remote ML-L320051,000630500
52mm circular polariser20041,0501,050800
67mm circular polariser20051,4851,4001,100
Sandisk 2GB CF card20044,000800500
Sandisk 2GB Extreme III20072,3001,300800
LowePro Photo Runner bag20052,0003,0001,500

Extras: The D70’s rechargeable battery and charger, a spare higher capacity battery, a lithium cell holder with three CR2 cells, 52mm rubber hood for the 50mm lens, 67mm petal hood and carrying case for the Sigma lens, spare 128 MB CF card and card holder, a compact CF card reader, USB cable, UV filters on both lenses, rubber blower and nylon brush for cleaning the CCD, and a lens filter holder with space for six that fits into the LowePro bag.

If you were to buy all this second hand one piece at a time, it will cost you ~Rs 35,000. My offer price: Rs 25,000.

I took the D70 to Nikon’s service centre late last year to fix accumulated wear and tear. They replaced its power switch, CF card slot and rear-side outer body at a cost of Rs 5220. The camera’s traveled only twice since. As a result of this, it is in much better shape than it would have been for its age.

I’ve used this camera for practically every picture I’ve made in the last five years (barring July’s Manali and Ladakh trip). If you’ve liked my pictures in the past, this is the equipment that made them possible. Here’s a sample gallery.

Buyer must collect in person. I’m currently in Mcleodganj, Dharamsala, carrying everything except the polariser filters and will be in New Delhi later this week before returning to Bangalore. Questions? Leave me a comment.

Update: Sold!

QWERTY-be-gone

Much of the debate around modern mobile handsets is around the text entry mechanism. If you’ve gotten used to a device with a QWERTY pad, will you be able to go back to T9? How can anyone touch type on a device with an on screen keyboard? Will haptic feedback make them just as usable?

All these debates (except around T9) make one fundamental assumption: keypads can only use the QWERTY layout. This is where one must take exception. QWERTY is a 140 year old standard with a seemingly random layout of letters that were arranged to avoid mechanical jamming in the technology of the 1870s. Generations of typists have grown up with memories of their baffling first encounter with the layout, something they had to learn because that’s the way it’s always been done.

To shrink that same layout down to under three inches and call it state of the art is just bizarre. Keyboards are from an era where the technology of the day demanded a two dimensional layout. We’re no longer constrained by that technology. Look at your fingers, folks. Look at how amazingly dextrous each of them is, how capable of independent movement each is. Look at how you can hold a pen to paper and make coordinated muscle movements across fingers to write. Keyboards take no advantage of this ability. A keyboard is a flat, rectangular layout with a key for each symbol, where your only possible interaction with that key is to press it down.

Rather than shrinking that rectangular layout, why not change the possible interactions with each key? How about if you could both press down and up? What if you could record interaction with every joint in your finger, instead of just the tip?

Chorded keyboards and keyers have been around for decades, but have failed to gain mainstream acceptance because (a) there’s a learning curve, and unlike the learning curve of QWERTY, there’s no incentive to scale it, and (b) as a result, there aren’t enough users to establish a standard from among the competitors.

But for the first time in the history of digital text communication, there are now more people who use their phone as their primary means of communication than a regular computer. The time is ripe for QWERTY’s mobile successor to be born.