Archive for 2009

A year in recap

Long time, no post. So much to say, but where’s the time to write with all this activity? Remind me to post on:

  • What it cost me to take a year off,
  • What I’ve been reading through these months,
  • What I did with the time and how I ended up doing each, and
  • What I’m up to now, back here in the land of the gainfully employed.
View from my new office window
The view from my new office window.

Netbook theme for Ubuntu

Upgraded to Karmic last night. The refresh of the Human theme is quite nice, but the bright orange icons no longer work, so I made a quick remix. Download:

Both versions are designed for 1024×600 netbook screens. For best results, you should also install maximus and window-picker-applet, and setup a single panel at top containing the applet.

Installation

Go to SystemPreferencesAppearance and install from there, or better, extract the tarball to /usr/share/themes as root. The latter will get it to work for system applications too.

Unicode precomposition and decomposition

As a result of recent Mac troubles, I moved my iTunes library to a Linux file server and setup iTunes on my old TiBook to access the library over an AFP share using netatalk.

This worked unexpectedly well, until I noticed something very odd: I could no longer access any file whose name contained an accented character such as “é”. These files showed up in directory listings but were not readable. The filesystem complained that the file just did not exist. After a whole evening lost trying to find fault with everything from Mac OS X to netatalk, I found myself in unfamiliar Unicode territory:

It turns out there are two ways to represent certain accented characters such as “é” in Unicode, either using unique code points (U+00E9, “latin small letter e with acute”) or using a regular ASCII character “e” with a combining diacritical mark (U+0065, “latin small letter e” followed by U+0301, “combining acute accent”). The first form is known as “precomposed” and is the standard for filenames on Linux, while the latter “decomposed” form is standard on Mac OS X.

The Mac approach is unusual but has the advantage of making accent insensitive search easier. A string search for “cafe” will also match “café” because the last character is really two; “cafeteria” can match for “caféteria” if one simply strips out diacritical marks. Doing this with precomposed strings is much harder. (Thanks to @deepakg for identifying this.)

Mac OS X enforces the decomposed form for filenames, but Linux doesn’t. On Linux, precomposed UTF-8 is expected but not enforced. The netatalk AFP server recognises this difference and transparently translates filenames between what it calls UTF8 and UTF8-MAC. This is where I ran into trouble. I had transferred my files using rsync and ended up with decomposed filenames on Linux. These showed up fine over AFP, but when Mac OS X attempted accessing them, netatalk did the transparent translation to precomposed names and could no longer find the files. The solution? Rename all files on the Linux side:

convmv -r -f utf-8 --nfd -t utf8 --nfc ./* --notest

And in future, when rsyncing files from Mac OS X to Linux, ask it to translate the filenames with this additional option (reversed for Linux to OSX):

rsync --iconv=UTF8-MAC,UTF8

Dead Mac

My Mac’s display died without warning one day last month. I was using it when blocks of randomly coloured pixels appeared on screen, obscuring the display. Rebooting didn’t fix it, nor did turning the power off to let it cool several hours. One of the internal fans had flaked out earlier and was in the habit of refusing to spin up every once in a while, so I suspect a burnt chip from overheating. I can no longer see enough of the display to boot up and login, but the machine continues to be fully functional when accessed over the network.

Dead Mac

Being occupied with other things, I put the machine aside a few weeks and finally opened it last weekend to take a look inside. The right side fan was dead. It appears have to lost its magnetic charge as a result of my previous attempts to clean it and no longer spins comfortably when flicked with the finger. I couldn’t tell what was wrong with the graphics, however, so decided to call Apple Support. The machine is a little over three years old and well over any sort of warranty period.

Mac Guts

Apple’s phone support directed me to the Ample Imagine store at Forum Mall in Koramangala. They said they’d have a look and tell me if it was fixable, but would charge Rs 750 for the inspection. I agreed and left my Mac with them. They called back yesterday and said they’d have to replace the logic board at the cost of Rs 35,000. Given that this is nearly half the price of a new Mac, I decided to save my money and use the machine as a display-less network server. This should have been the end of the affair, until I went in today to pick it up and noticed the job sheet:

The engineer’s comments said he had tried resetting the PRAM and connected an external display, but since that didn’t fix it, he had decided it was a logic board problem and suggested I get a new one.

What? That’s it? A diagnosis costing Rs 827 (750 + taxes) without even opening the machine? For all I knew, some chip could have had its soldering melted and come loose because of the overheating. It could even be just a loose connector on the board. Who trains these guys?

Now, there’s something to be said about this particular model. My Mac is a first generation Intel and unlike all Apple laptops that came before and after it, this one is not user upgradeable. It can’t be opened without literally cracking open the case, a process which leaves visible scars in the front, below the trackpad. I went through this process two years ago when upgrading the hard disk and spent over an hour gently tugging and wriggling a screwdriver to pry it open. An Apple engineer not aware of this history should have called me to confirm he could do this because of the risk involved, but no one called. There’s no way anyone could have opened it and failed to record that in the job sheet. They quite certainly didn’t.

This incompetence is appalling. I feel like I’ve been scammed of my money. These engineers seem to be trained to make diagnoses for machines within warranty, but not for anything requiring a real examination. Dear Apple: if you want to be a serious contender in India, you had better get your act together.

And for what it’s worth, I’m now on my own trying to get this fixed. I suppose I could start ordering parts off eBay and try my luck with guessing exactly what is broken, but it would help to have (a) real expert diagnosis and (b) a way to avoid wrangling with Indian customs when importing parts.

Do you know anyone I should be talking to? Or, know anyone with a dead MacBook Pro of the same period (Intel; pre-Unibody) who’d be willing to palm it off to me for spare parts?

I’m not as badly off as I could have been because I made a serious habit a couple of years ago of backing up everything, including having backup machines (currently an ASUS Eee PC 1005HA running Ubuntu Jaunty), but the machine’s absence is clearly felt, and I don’t have the budget for a new Mac until next year.

Update: As of January 2010, the Mac is working again. I got a new (used) logic board off eBay and opened the Mac to replace it, then figured I should test assembly on the old board first, just in case I damage anything. Surprise, surprise, the old board worked again! The new board however didn’t, so I sent it back and got a refund.

Disabling the alarm on APC UPSes

UPS Alarm
From the wonderful Fly, You Fools! webcomic by Saad Akhtar. Read the full strip.

You know what I mean. What were they thinking? Here’s a helpful explanation by an APC employee:

I understand your concern with not wanting to be woken up at 2am to be alerted that power has gone out in your residence. I use the software at home to disable the audible tone as well, however, I think taking a look at it from a different approach may be ideal. Is the UPS your source of power for your alarm clock in the morning? What would occur if you were to have to wake up at a specific time during the week, and your alarm clock, which is not powered by your UPS, powers off due to a blackout, even if it is momentary? I think it would be ideal in this scenario that the UPS wakes you to notify you of a power failure. That would allow you to possibly find an alternate source of power for the alarm clock, or, if power is to be restored within a reasonable period of time, to reset your clock so that you wake up on time.

Right. That’s why. That horrible shriek is meant to wake you up. If, like all real people, you have an alarm clock that runs on batteries and prefer a full night’s sleep, it turns out that you can disable it. This works on most common APC UPS models with the USB cable. Windows users should install APC’s PowerChute software. It apparently has an option somewhere to turn it off. On Linux, the apcupsd package will do it for you (make sure to plugin the USB cable first):

Read on...

What’s happening to our online communities?

Supriya Thanawala of the Hindustan Times wrote in asking if I had noticed how online community spaces over the years have grown to discourage pseudo-anonymous identities. I responded noting several trend lines:

  1. Internet adoption is growing, making governments increasingly more conscious that this is a new space they ought to be governing. That’s where the cyber cells and ISP IP logging come from.

  2. Any medium where an individual can be reached with little effort will be misused. Postal mail has junk marketing, telephones have telemarketers, email has spam, each cheaper than the previous. As the medium grows and becomes a worthwhile channel for junk messages, service providers come under increasing pressure to keep it usable for normal users. They do this by either requiring some real life id (such as by your ISP) or by limiting your use of the service (such as mailing list providers that limit the number of people you can directly add to your new list).

  3. The web is a public medium. Anybody can see anything posted there. The web is also very large, so resource discovery, and not access, becomes of primary importance. Blogging became popular because of this curious nature of the web. A blog was both private because nobody would find it until they got referred to it somehow, and public because you could always share the link. Online spaces felt like intimate communities in the early days because there were so few people online and you either knew who they were, or guessing that became an interesting game. As that count grew, partitioning spaces becomes important. Today’s Facebook is more or less private. You decide who your friends are and only they can see what you write. The rest of the web can’t.

  4. Early blog+social networking spaces like LiveJournal and Friendster have been grappling with anonymity and fake identities for long. Here’s something I wrote a few years ago. Some have attempted banning them outright, while others have tolerated them but ended up with mixed results (see this for a particularly entertaining example – those profiles originated in a very non-funny flamefest elsewhere, after which their makers decided to keep them going for a while). Facebook has taken the more pragmatic approach, allowing for the creation of “pages” distinct from profiles that users can interact with.

  5. Facebook arrives at a time when the web is increasingly seen as having little direct revenue value. Money is made via advertising, not from users paying up (in contrast, LiveJournal was profitable for several years because users paid for accounts with extra features; Flickr runs on the same model). The Pages feature on Facebook is largely seen as a marketing vehicle for a film or a product that users pay for off the web. This brings in marketing language, sanitised humour when there is any (notice that TV sitcoms are never as funny as the spontaneous writing of the Aaj Sholay community), and a referral to everything by a real world name in a manner that respects trademarks and copyrights.

So where is all the anonymity and creativity going now? It exists as always; it’s just out seeking new corners for itself away from the public eye.

(I suspect some of this isn’t quite true anymore, but I haven’t been thinking about it. Your thoughts?)

Why so jobless?

Years ago, at my first job, I attended an annual day talk by the founder and chairman of the group of companies. The man had started from scratch and built a 200 crore conglomerate with a range of business interests. This was supposed to be a pep talk about what visions he had for us, the newest company in the group, but he couldn’t help starting with a little about himself. His greatest pride in life? That he had never had a job working for someone else.

Well, so much for holding on to mine. Related reading (via @thej).

Analysing Wikipedia: caching data

I haven’t posted about Wikipedia in a while. Hans went to Ladakh right after I returned, so we’re only now getting around to analysing the data we collected in July.

Our biggest hassle with doing any kind of analysis is with how long it takes to retrieve data. A full text analysis of a few hundred revisions of a large page could easily take an hour to pull. If that analysis doesn’t produce satisfying results, attempting a variation requires pulling that data all over again, because we have no cache.

I use the mwclient library, which provides a thin wrapper representing MediaWiki queries as lazy (?) Python sequences. Since this sequence could be cached, I’ve been considering strategies (some of this assumes familiarity with mwclient):

  1. Implement a simple Python dictionary cache around the mwclient API, saving the query→result mapping as a pickled dump and consult that before hitting the servers again. This is easy, but since the sequences are lazy, the data isn’t available for caching until the code tries to access it. The cache has to intervene then. All my analysis code must now be written for two API’s, mwclient’s and my cache’s.

  2. Alternatively, do the same thing but as a patch to mwclient’s code so there’s a single external API. This requires understanding how it works and maintaining patches against upstream changes, which takes time away from analysis.

  3. Do it outside. Setup Squid or another caching proxy to cache everything regardless of HTTP headers. Make queries through this. Easy to setup, but grossly inefficient. Proxy servers understand request→response mapping, not sequences. If I ask for a subset of an earlier sequence, it’ll treat it as a new request. Sequences require special treatment:

    • There could be newer edits on the site, making the sequence’s beginning and end markers stale.

    • A new query may ask for overlapping results (typically, a query from a fixed starting point to current time). The cache should be able to join sequences instead of duplicating data.

    • A query may ask for the same time range as an earlier query, but with additional properties (typically, the full text of each revision). These additional properties should be merged into the cache.

  4. Drop this approach altogether and get a static dump of the Wikipedia database. But a full text dump of the entire revision history of the English Wikipedia is 150 terabytes. The resource requirements will take us out of the realm of a hobbyist project.

Given that data retrieval time has become a serious hobble, it seems worth tackling this head-on. A custom cache API could:

  1. Be sequence aware. Treat each MediaWiki article as being a sequence of unknown start and end, of which fragments are available in the cache. Join sequence fragments as data gaps are filled in, leading to one single sequence for the page’s entire revision history.

  2. Store additional properties on each revision. MediaWiki does not store diffs between revisions, but the cache could, since much full text analysis is based around the changes introduced by each revision. Properties could also be flags marking pages as, for example, vandalism, or the following reversion.

  3. Based on the above, store alternate sequences and properties specific to them. For example, a revision sequence of an article that skips all vandal/reversion revisions and stores edit diffs without them. Without this, an editor whose sole contribution was to revert vandalism will come out appearing to have added a lot of new material.

A web service implementing this API will, over time, be able to respond to queries in near real time, making it possible to build a web interface where anyone can submit a query. The public web interface is one of our eventual goals for this project.

I’ll post updates as I work out the technical architecture for this API. I’m considering using one of the newfangled key/value pair databases but have no experience with them. Recommendations are welcome.

Upper Dharamsala on a rainy day

I found myself in Mcleodganj last May, in the company of TB Dinesh and Guillaume Marceau. Dinesh wanted to pick up some luggage a friend had left behind in nearby Dharamkot, so off we went up the hill.

Man, was it hard! The incline could have killed me. I was out of breath and my feet ached. My camera bag felt like a huge burden. I had to stop for breath every turn and rest minutes. When we finally reached Dharamkot, I refused to leave the tea shop for the next couple hours. We hung around ordering several rounds of tea and snacks. Dinesh and Guillaume then wanted to walk further, so I reluctantly tagged along. The body ached but the mind couldn’t refuse the challenge. We walked all the way up hill, past prayer flags in the woods, past a shrine to the earlier Panchen Lama, the current Dalai Lama’s late teacher, up to the top, and down again through the Tibetan Children’s Village, along the water pipeline, back to Mcleodganj.

TB Dinesh and Guillaume MarceauSanjay's, DharamkotHike in the HillsTibetan Prayer FlagsTibetan Children's Village, Upper Dharamsala

I had cramps the next day. When I returned to Bangalore and checked my weight, I was down two kilos. In a day’s walk.

And so, a year later and halfway through this year’s resolution to improve health, I had to check again. Was it really so bad, or was I just so out of shape? Has all the cycling in Bangalore and walking in Ladakh’s thin air helped at all?

It has: the walk this time felt like a casual stroll through the woods.

Upper Dharamsala on a rainy day
From atop the hill overlooking the Tibetan Children’s Village (off to the left).

Nikon D70 + kit on sale

I’m putting my Nikon D70 camera and entire kit on sale. It’s served me well over the years and I’m now ready to move on. Here’s the kit contents, what they cost me, and what they’ll cost you if you buy new or individually second hand from eBay (some prices are approximate guesses; all prices are in Indian Rupees).

ItemPurchasedOriginal PriceCurrent NewSecond Hand
TotalRs 85,335Rs 48,680Rs 35,700
My OfferRs 25,000
Nikon D70 body200447,500NA (11,500)11,500
Nikon 50mm f/1.8 D20045,0006,0004,000
Sigma 18-50 f/2.8 EX DC200721,00023,00015,000
Nikon IR remote ML-L320051,000630500
52mm circular polariser20041,0501,050800
67mm circular polariser20051,4851,4001,100
Sandisk 2GB CF card20044,000800500
Sandisk 2GB Extreme III20072,3001,300800
LowePro Photo Runner bag20052,0003,0001,500

Extras: The D70’s rechargeable battery and charger, a spare higher capacity battery, a lithium cell holder with three CR2 cells, 52mm rubber hood for the 50mm lens, 67mm petal hood and carrying case for the Sigma lens, spare 128 MB CF card and card holder, a compact CF card reader, USB cable, UV filters on both lenses, rubber blower and nylon brush for cleaning the CCD, and a lens filter holder with space for six that fits into the LowePro bag.

If you were to buy all this second hand one piece at a time, it will cost you ~Rs 35,000. My offer price: Rs 25,000.

I took the D70 to Nikon’s service centre late last year to fix accumulated wear and tear. They replaced its power switch, CF card slot and rear-side outer body at a cost of Rs 5220. The camera’s traveled only twice since. As a result of this, it is in much better shape than it would have been for its age.

I’ve used this camera for practically every picture I’ve made in the last five years (barring July’s Manali and Ladakh trip). If you’ve liked my pictures in the past, this is the equipment that made them possible. Here’s a sample gallery.

Buyer must collect in person. I’m currently in Mcleodganj, Dharamsala, carrying everything except the polariser filters and will be in New Delhi later this week before returning to Bangalore. Questions? Leave me a comment.

Update: Sold!

QWERTY-be-gone

Much of the debate around modern mobile handsets is around the text entry mechanism. If you’ve gotten used to a device with a QWERTY pad, will you be able to go back to T9? How can anyone touch type on a device with an on screen keyboard? Will haptic feedback make them just as usable?

All these debates (except around T9) make one fundamental assumption: keypads can only use the QWERTY layout. This is where one must take exception. QWERTY is a 140 year old standard with a seemingly random layout of letters that were arranged to avoid mechanical jamming in the technology of the 1870s. Generations of typists have grown up with memories of their baffling first encounter with the layout, something they had to learn because that’s the way it’s always been done.

To shrink that same layout down to under three inches and call it state of the art is just bizarre. Keyboards are from an era where the technology of the day demanded a two dimensional layout. We’re no longer constrained by that technology. Look at your fingers, folks. Look at how amazingly dextrous each of them is, how capable of independent movement each is. Look at how you can hold a pen to paper and make coordinated muscle movements across fingers to write. Keyboards take no advantage of this ability. A keyboard is a flat, rectangular layout with a key for each symbol, where your only possible interaction with that key is to press it down.

Rather than shrinking that rectangular layout, why not change the possible interactions with each key? How about if you could both press down and up? What if you could record interaction with every joint in your finger, instead of just the tip?

Chorded keyboards and keyers have been around for decades, but have failed to gain mainstream acceptance because (a) there’s a learning curve, and unlike the learning curve of QWERTY, there’s no incentive to scale it, and (b) as a result, there aren’t enough users to establish a standard from among the competitors.

But for the first time in the history of digital text communication, there are now more people who use their phone as their primary means of communication than a regular computer. The time is ripe for QWERTY’s mobile successor to be born.

Being offline

I spent most of July offline, travelling, for the most part in Ladakh. It’s hard to miss the internet in a place like this:

Contemplating the Zanskar
At the confluence of the Indus and the Zanskar.

The experience was so relieving that I’m considering spending a few more months doing this – travelling and staying offline.

Seven and a half years of Evolution

To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.

Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…

The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.

Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.

Analysis to follow.

Vapour and vacuum

If you release a litre of water into the vacuum of outer space, what will happen to it?

It will vapourise instantly, just as a compressed aerosol at Earth surface pressure, and in the process cool down far below freezing point. What happens to the molecules then?

Do they float away as free molecules, no longer ice? Does the crystalline structure of the ice hold them solid? Or if that is too late or not strong enough, does gravitational attraction pull them back together? Will Earth’s own gravitational pull be strong enough to bring them down?

Pictures from #socmob

I’ve posted some pictures from last month’s discussion on using social media for mobilisation, with Dina Mehta and Peter Griffin at CIS. Here’s the report and earlier Twitter feed.

Nothing significant; just some faces. Helping with attaching names to faces appreciated.

Updated CIS website

I spent the last two weeks cleaning up the website for the Centre for Internet and Society. Check it out and let me know what you think.

Book signing

“Do you read fiction?” I asked Manish.

“Huh?” he stammered. Only minutes before, I had asked if he could write Python code to generate the Fibonacci sequence, my standard test for recruits. He was trying to work that out and I was growing impatient.

“Um, yes…” he tried to answer, but I wasn’t listening. I said, “There’s a book reading at Crossword in about fifteen minutes. Let’s continue there.”

Amitav Ghosh was in town to promote his new book Sea of Poppies. I had been seeing his books on shelves for years, but hadn’t read any, being generally sceptical of Indian authors. Many years back, when each new book cost me months of savings and days of careful consideration, I had on occasion hazarded a technical book by an Indian author, and inevitably ended up bitter. For all their cover promises, the books were always fluff.

Amitav Ghosh is good, Zainab said. But Indian fiction in English? Admittedly, I hadn’t tried any. Couldn’t hurt to try, given I can afford to buy and not read a book these days.

And so that evening, I interrupted the interview and took the candidate to a book reading, asking him to think out the code and dictate it to me later. Ghosh read an excerpt from his book and discussed it with his host. I hadn’t been to a book reading before and didn’t know what to expect. When the discussions ceased and people queued up to get their books signed, I joined.

At my turn, I put two books down on the desk. Ghosh opened one and looked up expectantly, then said “Who’s it for?”

“Huh?”

Who’s it for? For myself? I was picking a copy for myself. Who could it be for?

“For Kiran,” I said.

Wait, that sounded wrong. Someone was missing. Someone who should have come first. “…and Zainab,” I hastily added. “For Kiran and Zainab,” he wrote.

And that was how I brought home my first author-signed copy and ended up apologising for it.

Chandrahas Choudhury was in town this evening for his new book Arzee the Dwarf. Zainab said to say hi. She knew him? Well yes, through the Mumbai blogger circuit. I joined the queue and, when my turn came, offered a reminder of our brief meeting in Manipal last year. “Of course,” he said. “Where’s Zainab? I’m going to write this out to her too.”

“To Kiran and Zainab,” he wrote.

This post intentionally left blank

There was going to be a post here, but my browser ate it up and I’m now too mad to be writing it again.

Some things have incredibly steep learning curves, but we struggle over them anyway, because on the other side of the curve we get our *-fu master black belt. We go through life collecting and exhibiting our belts. Every once in a while we come across someone with a belt that makes us envious, that won’t get off our minds, and yet, when it comes to facing that curve ourselves, it no longer seems worth it. Why is that?

Rank

“Wait here,” said Srinivas, and disappeared from view before I could turn around.

Behind me, vehicles honked as they approached the narrow intersection. I pushed the bike to the edge of the road, parked, and swung the backpack over to my back. Where had he gone? The building behind me looked busy. I walked over and looked up the steps into the corridor. No sign of him.

The guard rattled his cane and said “What do you want?” Something about his tone put me off. I hate it when people question the authority on which one exists as they do. I was standing on a public road where I had every right to stand. What was his problem? And where was Srinivas?

“This is a ladies hostel,” he said. “Go away from here.” I looked up again and noticed for the first time that every one of the persons entering and exiting the building was female. This was somehow supposed to be my fault? Who did he think I was, a college romeo? The backpack! Did he… oh dear… really think I was a student?

“I am thirty years old,” I wanted to say, “and married.” Why should I care that this is a ladies hostel? But damn it, he didn’t deserve to know that. What business was it of his? I had had my share of being lorded over by petty officials back in my school days. I was going to have none of it now. I was not going to be sorry for who I was just because some two bit minimum-wage guard had an inflated sense of his own importance.

Who did he think I was? My mother had been a founding principal of one of their schools, and had run it for ten years. I had grown up riding down this very road through their gates to pick her up every evening. I would park my bike in the staff parking area and walk into the principal’s office, unchecked. And now, I was the suspicious character? The gall of it!

I said nothing. How was I to compress all that into a single, coherent statement? One that said, in addition, that while I had nothing against him personally, he ought to know better than to insult someone with such impeccable credentials? That if he dared make a move, I was perfectly capable of pulling rank?

He continued glaring at me. I shrugged and walked back to the bike, pretending not to have noticed. Srinivas returned several minutes later and announced that there may be some houses in the next block. I wanted to tell him of what this place meant to me, nay, of what I meant to this place. The ego had to be soothed. But I said nothing, and we resumed our house search.

(Part of a writing practice series.)

Charting languages

Guillaume Marceau, who made a guest post here on how to make comparison charts, has an excellent demonstration of this technique over on his blog, charting performance against code verbosity in programming languages:

The speed, size and dependability of programming languages

If you drew the benchmark results on an XY chart you could name the four corners. The fast but verbose languages would cluster at the top left. Let’s call them system languages. The elegantly concise but sluggish languages would cluster at the bottom right. Let’s call them script languages. On the top right you would find the obsolete languages. That is, languages which have since been outclassed by newer languages, unless they offer some quirky attraction that is not captured by the data here. And finally, in the bottom left corner you would find probably nothing, since this is the space of the ideal language, the one which is at the same time fast and short and a joy to use.

Two Bits in EPub format

Chris Kelty’s book, Two Bits: The Cultural Significance of Free Software is available as a free download in PDF and HTML formats. Neither version is suitable for reading on a handheld or ebook reader, so I’ve made an EPub version. EPub is an XHTML-based self-contained single file document format. Download here. This is how I made it from the HTML version:

Read on...

Hypothesis: edit wars have a lot of edits, but few editors

This is a post by guest blogger Guillaume Marceau from http://gmarceau.qc.ca/blog/

Chart of edit wars over the Wikipedia article on evolution

Click on the chart to see the full-sized version.

The essence when making charts is to make use of our eyes’ fantastic ability to compare amounts across a page. In fact, our eyes are so good at these kinds of comparison that a chart will often read better if two data sets are placed next to each other (aligned with a common axis) rather than overlayed. Edward Tufte popularized this idea in his book Envisioning Information by calling it a ‘small multiples’ design. This simple idea is surprisingly versatile. For instance, it can inform the design of a user interface.

A well-designed chart should read like a book so that it may become an integral part of the narrative in support of the argument. In the chart above, the labels are disposed in normal reading order, from left to right, top to bottom. As we read, we understand the nature of the computation that was applied, and the significance of the three humps in the last chart.

Changes committed to Zine upstream

I’ve pushed my patches to Zine to the upstream repository, making for a total of 45 changesets with 106 changes to 50 files (as per Mercurial). The main repository is here.

If you’ve been working with my repository though, you should stick with it because there is one fundamental incompatibility. My version stores the extra field on Post and User objects as JSON. The main repo stores as a Python pickle. Armin and I have agreed that while JSON is the way to go forward, we should switch via a database migration framework, which Zine doesn’t have yet.

Until that framework is in place and the merge happens, I’ll continue to sync changesets across the two repositories.

Editors and edits

Hans and I met up this evening to discuss the moving average data I had collected. “But what about editors?”, he asked. So I extended it to get that too. Here’s the data. (Hat tip to Vaibhav Bhawsar, who also pointed that out.)

This chart is a visual mess. It’s also close to the limits of how much data I can pass the Google Chart API, so I’ll need a better system in place for the next round, something that allows zooming in for closer analysis. For what it’s worth, here are the key things about this chart:

  1. The blue Moving Window line is now the sum of the preceding seven day period, not the average.
  2. The dark gray Editors in Window line is the number of unique editors within each window.
  3. The y-axis labels are off by a little bit. I can’t figure out why they are not properly calibrated.
  4. Edit Count and Editor Count hug each other closely, but have clearly visible differences in the moving window.

Edit history of Evolution on Wikipedia

If you look at the three biggest blue peaks, the first on June 11 (54) and third on March 1 (59) have a large number of editors (27 and 24), while the peak of August 25 (50) has only 11 editors. You may recall from the last graph that August 25 was the day of the highest number of edits in the year.

Hans thinks that if we render a scatter graph plotting 1/edits-in-window for the x-axis and editors-in-window/edits-in-window for the y-axis, the first and third peaks will show up close to (0,1).

Assuming this works as predicted, we’re close to building a first level user-facing analysis tool: give it a page and a date range, and it’ll tell you approximately when there was an edit war, for closer inspection using content analysis.

Detecting reverts in Wikipedia

Sirtaj Singh Kang asked on Twitter:

Are you able to detect reverts? That would be able to help show edit wars. Perhaps a graph of edit/revert avg?

Short answer: no and yes. Reverts are not marked in the database. They can be detected, but the effort is non-trivial.

MediaWiki has for some time shown a “rollback” link on the latest revision in the history, visible to logged in users. If you click it, MediaWiki will do the following:

  1. Make a new revision of the page copying the content of the revision before the last.
  2. Annotate this new revision as a minor change, with the comment “Reverted edits by (user) to last revision by (earlier user)”.

It’s easy to detect this text pattern and consider that a revert, but not everyone uses the rollback link. From the Evolution page’s history, here’s another revert string: “Reverting possible vandalism by When nine to version by Hubrid Noxx. False positive? Report it. Thanks, ClueBot. (677946) (Bot)”. This string comes from ClueBot. Other admins could use something else. Some may be operating entirely unassisted: if they see two or more incidents of vandalism (thereby making the rollback shortcut unusable), they’ll go back to the last clean edit and save it again. What’ll they put in the change comment field? I don’t know. I can’t guess reverts from human language.

By looking at the revision comments, one could:

  1. Get a reasonable picture of reverts using known automated tools,
  2. Completely miss reverts that didn’t use those tools, and
  3. Pick up false positives from vandals claiming to have performed a revert when they didn’t.

The last one makes this particularly vulnerable to gaming, should a tool built around this method become key to fighting vandalism.

There is another approach: by analysing the text of each revision. If a given revision differs from the earlier revision but matches the one before it (or two revisions earlier), you have a revert. Some forms of vandalism involve a complete replacement of the page’s contents. Others are just an additional phrase or link. Revert detection could be on the basis of either substantial similarity to an earlier revision, or of being exactly the same.

That’s the theory. In practice, pulling the full text of each revision from Wikipedia is painfully slow. I’ve tried this already, looking for what words were added and removed with each revision. Here’s the code. One month of revisions for Evolution takes about ten minutes on my connection. A longer analysis on multiple pages could easily overwhelm resources.

WikiMedia puts out database dumps for download, for exactly this sort of analysis. In this case, the dump we want, enwiki-latest-pages-meta-history.xml.bz2 from this folder, is a cool 147.7G, heavily compressed. Even if I had the bandwidth to download it in anywhere under a month, dealing with a dataset that large will take an entirely different order of code-chops. But I’d like to get there. :)

For now, the hit-or-miss fuzzy comment match is what we have to live with.

Needless to say, my use of the term “vandalism” paints this as far more black-and-white, us-versus-them than it really is. I use it only to simplify the analytical viewpoint.