Entries in the Category “Geekery”

Optional unique fields with MongoDB

I’m trying out MongoDB for a new project and am having an interesting time wrapping my head around the way it works. I’ll post notes to this blog whenever there’s something particularly interesting. Today’s is one such.

MongoDB is a schema-free key:value store, where the value is a JSON document. It’s the leading contender in the NoSQL movement. JSON documents can be of arbitrary depth, just like XML, storing anything from a single snippet to an entire database. This flexibility makes me happy. Schemas that required tedious snowflake definitions in an RDBMS can be a single document in MongoDB.

Today’s eyebrow raiser, however, came from the way MongoDB handles indexing. Let me illustrate with a traditional SQL schema for user accounts. I want to handle three different scenarios for how user accounts come to exist. Users may sign up on the web, in which case they will have to pick a username. Users may be introduced via email (the way Posterous and TripIt work), in which case they have an email address but no username, or users may login the first time via OpenID, in which case neither username or email address is known at the time of account creation. An account may exist with any one of these three identifiers, but each is expected to be unique across the system. In SQLAlchemy:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Unicode, Integer
Base = declarative_base()

class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    username = Column(Unicode(50), nullable=True, unique=True)
    email = Column(Unicode(50), nullable=True, unique=True)
    openid = Column(Unicode(250), nullable=True, unique=True)

I have three columns here, all of which are unique but optional. From SQLite’s documentation:

The UNIQUE constraint causes an unique index to be created on the specified columns. All NULL values are considered different from each other and from all other values for the purpose of determining uniqueness, hence a UNIQUE column may contain multiple entries with the value of NULL.

See? That’s straightforward. If the field contains a value, it has to be unique. If there’s no value (ie, null), that’s okay too. Job done. I figured it would work the same way with MongoDB, so I whipped this up with MongoEngine, a nice ORMish wrapper that allows defining basic validation on models:

from mongoengine import *

class User(Document):
    username = StringField(required=False, unique=True)
    email = StringField(required=False, unique=True)
    openid = StringField(required=False, unique=True)

This should work, right? It doesn’t. Behind the scenes, MongoEngine converts this model into a MongoDB index:

db.user.ensureIndex({username: 1}, {unique: true});
db.user.ensureIndex({email: 1}, {unique: true});
db.user.ensureIndex({openid: 1}, {unique: true});

MongoDB is schema-free, so these keys may or may not exist in the document, but here is what the documentation has to say about unique indexes:

When a document is saved to a collection with unique indexes, any missing indexed keys will be inserted with null values. Thus, it won’t be possible to insert multiple documents missing the same indexed key.

In MongoDB, null is also considered a unique value. Bummer. I finally came up with this solution:

from mongoengine import *

class User(Document):
    username = StringField(required=False)
    email = StringField(required=False)
    openid = StringField(required=False)

    meta = {
        'indexes': ['username', 'email', 'openid'],
        }

    def __repr__(self):
        return u"<User '%s'>" % (self.username or self.email or self.openid)

    def validate(self):
        super(User, self).validate()
        if not self.username and not self.email and not self.openid:
            raise ValidationError(u"One of 'username', 'email' or 'openid' must be provided.")

        # Check for uniqueness of username, email and openid.
        if self.username:
            existing = User.objects(username=self.username).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"Username '%s' already in use." % self.username)
        if self.email:
            existing = User.objects(email=self.email).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"Email '%s' already in use." % self.email)
        if self.openid:
            existing = User.objects(openid=self.openid).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"OpenID '%s' already in use." % self.openid)

MongoEngine calls instance.validate() during a save(), so the constraints get applied. This isn’t particularly clean. The same check is repeated thrice, once for each field, and that’s up to three queries every time I want to save data. However, I don’t expect optional-but-unique to be a recurring pattern, and user accounts are infrequently edited, so this should be okay for now.

Fontconfig and Chromium

Thanks to this explanation from Evan Martin, I’ve finally figured out how to get fonts looking right in Chromium on Linux. The crux of the problem is that there are two font configuration systems. UI widgets read settings from one, the panel at System → Preferences → Appearance, while the browser area reads from fontconfig.

I prefer the appearance of full sub-pixel hinting on Ubuntu, but this setting, selected from the preferences panel, wasn’t being propagated to fontconfig, so web pages continued to have blurry text. The fix? Here is my new ~/.fonts.conf:

<?xml version='1.0'?>
<!DOCTYPE fontconfig SYSTEM 'fonts.dtd'>
<fontconfig>
    <match target="font">
        <edit mode="assign" name="hintstyle">
            <const>hintfull</const>
        </edit>
        <edit mode="assign" name="rgba">
            <const>rgb</const>
        </edit>
    </match>
</fontconfig>

Evan proposes a test for what hintstyle fontconfig is using:

$ fc-match -v Arial | grep hintstyle
        hintstyle: 3(i)(w)

The hintstyle of “3” indicates full hinting. Ubuntu’s default, slight hinting, would be “1”.

Open source as infrastructure

On the last day of FOSS.in 2009, some of us gathered in the speakers’ hotel to hang on to that sense of wonder for just a bit longer. Ramkumar Ramachandra and I ended up discussing open source philosophy late into the night. Ram’s consolidated his thoughts from that evening into a pair of posts on open source as infrastructure and community and business interaction. Both were posted earlier this month, but I somehow missed them.

My own understanding of the infrastructure angle to open source comes from Doc Searls’s writing around 2001. Doc has a more recent write-up on understanding infrastructure (Apr 2008) that’s well worth reading.

Netbook theme for Ubuntu

Upgraded to Karmic last night. The refresh of the Human theme is quite nice, but the bright orange icons no longer work, so I made a quick remix. Download:

Both versions are designed for 1024×600 netbook screens. For best results, you should also install maximus and window-picker-applet, and setup a single panel at top containing the applet.

Installation

Go to SystemPreferencesAppearance and install from there, or better, extract the tarball to /usr/share/themes as root. The latter will get it to work for system applications too.

Unicode precomposition and decomposition

As a result of recent Mac troubles, I moved my iTunes library to a Linux file server and setup iTunes on my old TiBook to access the library over an AFP share using netatalk.

This worked unexpectedly well, until I noticed something very odd: I could no longer access any file whose name contained an accented character such as “é”. These files showed up in directory listings but were not readable. The filesystem complained that the file just did not exist. After a whole evening lost trying to find fault with everything from Mac OS X to netatalk, I found myself in unfamiliar Unicode territory:

It turns out there are two ways to represent certain accented characters such as “é” in Unicode, either using unique code points (U+00E9, “latin small letter e with acute”) or using a regular ASCII character “e” with a combining diacritical mark (U+0065, “latin small letter e” followed by U+0301, “combining acute accent”). The first form is known as “precomposed” and is the standard for filenames on Linux, while the latter “decomposed” form is standard on Mac OS X.

The Mac approach is unusual but has the advantage of making accent insensitive search easier. A string search for “cafe” will also match “café” because the last character is really two; “cafeteria” can match for “caféteria” if one simply strips out diacritical marks. Doing this with precomposed strings is much harder. (Thanks to @deepakg for identifying this.)

Mac OS X enforces the decomposed form for filenames, but Linux doesn’t. On Linux, precomposed UTF-8 is expected but not enforced. The netatalk AFP server recognises this difference and transparently translates filenames between what it calls UTF8 and UTF8-MAC. This is where I ran into trouble. I had transferred my files using rsync and ended up with decomposed filenames on Linux. These showed up fine over AFP, but when Mac OS X attempted accessing them, netatalk did the transparent translation to precomposed names and could no longer find the files. The solution? Rename all files on the Linux side:

convmv -r -f utf-8 --nfd -t utf8 --nfc ./* --notest

And in future, when rsyncing files from Mac OS X to Linux, ask it to translate the filenames with this additional option (reversed for Linux to OSX):

rsync --iconv=UTF8-MAC,UTF8

Dead Mac

My Mac’s display died without warning one day last month. I was using it when blocks of randomly coloured pixels appeared on screen, obscuring the display. Rebooting didn’t fix it, nor did turning the power off to let it cool several hours. One of the internal fans had flaked out earlier and was in the habit of refusing to spin up every once in a while, so I suspect a burnt chip from overheating. I can no longer see enough of the display to boot up and login, but the machine continues to be fully functional when accessed over the network.

Dead Mac

Being occupied with other things, I put the machine aside a few weeks and finally opened it last weekend to take a look inside. The right side fan was dead. It appears have to lost its magnetic charge as a result of my previous attempts to clean it and no longer spins comfortably when flicked with the finger. I couldn’t tell what was wrong with the graphics, however, so decided to call Apple Support. The machine is a little over three years old and well over any sort of warranty period.

Mac Guts

Apple’s phone support directed me to the Ample Imagine store at Forum Mall in Koramangala. They said they’d have a look and tell me if it was fixable, but would charge Rs 750 for the inspection. I agreed and left my Mac with them. They called back yesterday and said they’d have to replace the logic board at the cost of Rs 35,000. Given that this is nearly half the price of a new Mac, I decided to save my money and use the machine as a display-less network server. This should have been the end of the affair, until I went in today to pick it up and noticed the job sheet:

The engineer’s comments said he had tried resetting the PRAM and connected an external display, but since that didn’t fix it, he had decided it was a logic board problem and suggested I get a new one.

What? That’s it? A diagnosis costing Rs 827 (750 + taxes) without even opening the machine? For all I knew, some chip could have had its soldering melted and come loose because of the overheating. It could even be just a loose connector on the board. Who trains these guys?

Now, there’s something to be said about this particular model. My Mac is a first generation Intel and unlike all Apple laptops that came before and after it, this one is not user upgradeable. It can’t be opened without literally cracking open the case, a process which leaves visible scars in the front, below the trackpad. I went through this process two years ago when upgrading the hard disk and spent over an hour gently tugging and wriggling a screwdriver to pry it open. An Apple engineer not aware of this history should have called me to confirm he could do this because of the risk involved, but no one called. There’s no way anyone could have opened it and failed to record that in the job sheet. They quite certainly didn’t.

This incompetence is appalling. I feel like I’ve been scammed of my money. These engineers seem to be trained to make diagnoses for machines within warranty, but not for anything requiring a real examination. Dear Apple: if you want to be a serious contender in India, you had better get your act together.

And for what it’s worth, I’m now on my own trying to get this fixed. I suppose I could start ordering parts off eBay and try my luck with guessing exactly what is broken, but it would help to have (a) real expert diagnosis and (b) a way to avoid wrangling with Indian customs when importing parts.

Do you know anyone I should be talking to? Or, know anyone with a dead MacBook Pro of the same period (Intel; pre-Unibody) who’d be willing to palm it off to me for spare parts?

I’m not as badly off as I could have been because I made a serious habit a couple of years ago of backing up everything, including having backup machines (currently an ASUS Eee PC 1005HA running Ubuntu Jaunty), but the machine’s absence is clearly felt, and I don’t have the budget for a new Mac until next year.

Update: As of January 2010, the Mac is working again. I got a new (used) logic board off eBay and opened the Mac to replace it, then figured I should test assembly on the old board first, just in case I damage anything. Surprise, surprise, the old board worked again! The new board however didn’t, so I sent it back and got a refund.

Disabling the alarm on APC UPSes

UPS Alarm
From the wonderful Fly, You Fools! webcomic by Saad Akhtar. Read the full strip.

You know what I mean. What were they thinking? Here’s a helpful explanation by an APC employee:

I understand your concern with not wanting to be woken up at 2am to be alerted that power has gone out in your residence. I use the software at home to disable the audible tone as well, however, I think taking a look at it from a different approach may be ideal. Is the UPS your source of power for your alarm clock in the morning? What would occur if you were to have to wake up at a specific time during the week, and your alarm clock, which is not powered by your UPS, powers off due to a blackout, even if it is momentary? I think it would be ideal in this scenario that the UPS wakes you to notify you of a power failure. That would allow you to possibly find an alternate source of power for the alarm clock, or, if power is to be restored within a reasonable period of time, to reset your clock so that you wake up on time.

Right. That’s why. That horrible shriek is meant to wake you up. If, like all real people, you have an alarm clock that runs on batteries and prefer a full night’s sleep, it turns out that you can disable it. This works on most common APC UPS models with the USB cable. Windows users should install APC’s PowerChute software. It apparently has an option somewhere to turn it off. On Linux, the apcupsd package will do it for you (make sure to plugin the USB cable first):

Read on...

QWERTY-be-gone

Much of the debate around modern mobile handsets is around the text entry mechanism. If you’ve gotten used to a device with a QWERTY pad, will you be able to go back to T9? How can anyone touch type on a device with an on screen keyboard? Will haptic feedback make them just as usable?

All these debates (except around T9) make one fundamental assumption: keypads can only use the QWERTY layout. This is where one must take exception. QWERTY is a 140 year old standard with a seemingly random layout of letters that were arranged to avoid mechanical jamming in the technology of the 1870s. Generations of typists have grown up with memories of their baffling first encounter with the layout, something they had to learn because that’s the way it’s always been done.

To shrink that same layout down to under three inches and call it state of the art is just bizarre. Keyboards are from an era where the technology of the day demanded a two dimensional layout. We’re no longer constrained by that technology. Look at your fingers, folks. Look at how amazingly dextrous each of them is, how capable of independent movement each is. Look at how you can hold a pen to paper and make coordinated muscle movements across fingers to write. Keyboards take no advantage of this ability. A keyboard is a flat, rectangular layout with a key for each symbol, where your only possible interaction with that key is to press it down.

Rather than shrinking that rectangular layout, why not change the possible interactions with each key? How about if you could both press down and up? What if you could record interaction with every joint in your finger, instead of just the tip?

Chorded keyboards and keyers have been around for decades, but have failed to gain mainstream acceptance because (a) there’s a learning curve, and unlike the learning curve of QWERTY, there’s no incentive to scale it, and (b) as a result, there aren’t enough users to establish a standard from among the competitors.

But for the first time in the history of digital text communication, there are now more people who use their phone as their primary means of communication than a regular computer. The time is ripe for QWERTY’s mobile successor to be born.

Seven and a half years of Evolution

To prepare our next analysis, I parsed the Evolution page’s entire revision history for individual words added and removed. The first available revision is from December 3, 2001, making that just about seven and a half years worth of revisions.

Here’s the raw data file, 4.8 MB bzipped, expanding to 76.4 MB. Content format: UTC Timestamp, Revision Id, User, Add/AddStems/Del/DelStems, List of words…

The data includes both words and their stems. The stems are calculated using the Porter stemmer, without semantic context (background reading). Letter case has been preserved since I have no means to distinguish between proper nouns and sentence-beginning capitalisation. To get the list of words, I start with the article’s raw text, strip it of HTML tags, tokenise it by alphanumeric characters to get a stream of words, and then diff that against the previous revision’s word stream (the same algorithm as diff -u on the command line). A displaced word will thereby show up as both added and deleted. The tokeniser isn’t perfect: the word “isn’t” will be broken up into “isn” and “t” since the apostrophe doesn’t count as alphanumeric. Suggestions on how to make a better one appreciated.

Here’s the code if you’d like to try this yourself. You’ll need the other modules in the folder, the NLTK library, and the mwclient library.

Analysis to follow.

Updated CIS website

I spent the last two weeks cleaning up the website for the Centre for Internet and Society. Check it out and let me know what you think.

Two Bits in EPub format

Chris Kelty’s book, Two Bits: The Cultural Significance of Free Software is available as a free download in PDF and HTML formats. Neither version is suitable for reading on a handheld or ebook reader, so I’ve made an EPub version. EPub is an XHTML-based self-contained single file document format. Download here. This is how I made it from the HTML version:

Read on...

Changes committed to Zine upstream

I’ve pushed my patches to Zine to the upstream repository, making for a total of 45 changesets with 106 changes to 50 files (as per Mercurial). The main repository is here.

If you’ve been working with my repository though, you should stick with it because there is one fundamental incompatibility. My version stores the extra field on Post and User objects as JSON. The main repo stores as a Python pickle. Armin and I have agreed that while JSON is the way to go forward, we should switch via a database migration framework, which Zine doesn’t have yet.

Until that framework is in place and the merge happens, I’ll continue to sync changesets across the two repositories.

Querying Wikipedia with mwclient

mwclient is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we’ll look at how we can use mwclient to query any MediaWiki-powered site for the information we want.

Installing mwclient

As of this writing, the 0.6.2 release of mwclient does not include an installer and isn’t available in the Python Package Index, so installation is a bit of a chore. Grab the latest release from the downloads page; it should uncompress to reveal an mwclient folder. Copy this folder to your Python’s site-packages folder. If you don’t know where that is, type this at the command line:

python -c "from distutils.sysconfig import get_python_lib; print get_python_lib()"

The following locations are typical:

/usr/lib/python2.x/site-packages
/var/lib/python-support/python2.x
/Library/Python/2.x/site-packages

Launch Python and confirm it’s installed:

>>> import mwclient

If that didn’t raise any errors, congratulations! You’re all set to go.

Using mwclient

Here’s how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:

>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
...     rev = revisions.next()
...     pprint(rev)
... 
{'revid': 290932490,
 'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
 'user': 'Benlisquare'}
{'anon': '',
 'revid': 290930263,
 'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
 'user': '62.254.235.147'}
{'anon': '',
 'revid': 290930082,
 'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
 'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
 'revid': 290927544,
 'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
 'user': 'SoxBot'}
{'anon': '',
 'revid': 290927187,
 'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
 'user': '62.254.235.147'}

Compare the output you get with the page’s revision history on Wikipedia. They should match.

Calling page.revisions() gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon key indicates an anonymous edit; user then contains the editor’s IP address instead of user name. All keys and string values will be Unicode strings.

To get all edits between two dates in the forward direction, with the text content of each revision, do this:

>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
...                            end='2009-05-19T23:59:59Z',
...                            dir='newer',
...                            prop='ids|timestamp|flags|comment|user|content')

And here’s how to get all the edits of any given user. Let’s look at SoxBot from the revisions above:

>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
...     rev = contribs.next()
...     pprint(rev)
... 
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 17244650,
 'revid': 290942689,
 'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
 'title': 'User talk:Twinzor',
 'top': '',
 'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 21352732,
 'revid': 290942678,
 'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
 'title': 'User talk:Turco85',
 'top': '',
 'user': 'SoxBot'}

Notes

  1. MediaWiki timestamp strings can be generated using "%Y-%m-%dT%H:%M:%SZ" as format string with Python’s datetime.strftime. All timestamps must be in UTC.

  2. You can pass a combination of parameters to page.revisions() to get revisions the way you want them. You can even skip the dates and call with startid or endid = any revision number (see revid in the output), to retrieve revisions before or after that one.

  3. To look at what parameters the page.revisions() and site.usercontributions() functions take, use Python’s built-in help browser:

    >>> help(page.revisions)
    >>> help(site.usercontributions)
    

Hope that’s enough to get you started. In subsequent posts I’ll explain how we can use this to analyse Wikipedia revision history.

Remembering LiveJournal

Pradeep Gowda wrote to me on Twitter:

I was reading through your old journal entries circa 2000. Looks like you were a tweeter even back then .. ;)

Here’s a screenshot of my desktop from April 2000 (click for full size):

LoserJabber

See that text box with a submit button? That was LoserJabber (since renamed LogJam), the LiveJournal client for Linux. It was designed to sit in a corner of your screen so you could type into it every once in a while to describe what you were doing. That’s how everyone posted back then.

LiveJournal was the Twitter of ten years ago! Seriously, the number of things that came out of LiveJournal – the activity-oriented social graph, event sync, memcached, OpenID, threaded commenting, userpics – make it worthy of far more respect than it gets these days.

(Aside: yes, that’s GNOME 1.0 in that screenshot, and yes, it had brushed metal long before Mac OS X.)

On markup

I was looking up markup syntaxes recently and realised that one of the reasons I find writing in HTML tedious – apart from the verbosity – is that annotations are loaded up front. Consider this example dummy link pointing back at this post. If I were writing in HTML, I’d write:

Consider this <a href="/2009/09/15/on-markup">example dummy link</a>
pointing back at this post.

Contrast with popular markup languages, say Markdown (in two variations):

1. Consider this [example dummy link](/2009/05/15/on-markup)
   pointing back at this post.

2. Consider this [example dummy link][link] pointing back at this post.

[link]: /2009/05/15/on-markup

Or reStructuredText (two variations):

1. Consider this `example dummy link </2009/05/15/on-markup>`__
   pointing back this post.

2. Consider this `example dummy link`__ pointing back at this post.

__ /2009/05/15/on-markup

Notice how in HTML, the link target, which is an annotation on the text, appears before the text itself? It forces you to break the flow of thought and switch from writing mode to editing mode.

It gets worse when you have more content between the opening and closing.

Some background: I’ve spent a few weeks working on Zine, the blog engine powering this blog. Zine supports multiple markup formats, with the ability to add more via a plugin API. This post is written in Markdown, older posts are in reStructuredText, while the posts imported from LiveJournal use a parser I contributed that converts LJ’s markup to HTML. Because parsers cannot be trusted to be time-efficient, they are called once and the results cached in the database alongside the original text. The cached version is shown when a page is rendered.

Most blogs also have a feature to show only part of a post on the index page and the full text on the post’s own page. In WordPress, this happens when the user places a <!--more--> tag in their text, while LiveJournal uses <lj-cut text="optional cut text">. Zine needs a way to handle this across markup parsers. It does it by implementing a derivative of HTML called, appropriately, Zine Extensible Markup Language, which is basically the same thing as HTML but with a new <intro> tag and some other small features. If you’d like to introduce with a paragraph, cut, then write the rest, it’ll be something like this in ZEML:

<intro>
  <p>
    This is my introductory paragraph.
  </p>
</intro>
<!-- At this point, Zine will cut the post with a Read More link. -->
<p>
  And this is the rest of the post.
</p>

But what if you have a long intro leading to an enticing cut-line? You’ll get:

<intro text="and now, unveil the mystery...">
  <p>
    Blah, blah, blah and blah
  </p>
  <p>
    ...
    ...
    Even more blah
    ...
    ...
    Blah up to a compelling lead-in.
  </p>
</intro>

Whoops! The line that should have been last in your writing sequence moved all the way up to the top.* This is like top-posting in email, but worse: you write your own lines in reverse order.

Consensus in the CMS industry seems to be heading towards discouraging HTML markup and WYSIWYG editors, ushering users instead towards markup languages and limited-ability WYSIWYM editors. I can see why.

* Zine doesn’t actually support the text attribute on the intro tag, but I may add it.

New blog

Another year, another migration. Unlikely to be the last ever, but such is life. I’ve had a great time hacking on Zine the last few weeks, so this has been very worthwhile. I’ve backed up my LiveJournal here. I don’t plan to call it quits there, but given how often I write anywhere, that doesn’t mean much. The moblog came too, but because LiveJournal doesn’t yet support exporting comments in communities, it came sans the commentary.