Entries tagged “python”

Optional unique fields with MongoDB

I’m trying out MongoDB for a new project and am having an interesting time wrapping my head around the way it works. I’ll post notes to this blog whenever there’s something particularly interesting. Today’s is one such.

MongoDB is a schema-free key:value store, where the value is a JSON document. It’s the leading contender in the NoSQL movement. JSON documents can be of arbitrary depth, just like XML, storing anything from a single snippet to an entire database. This flexibility makes me happy. Schemas that required tedious snowflake definitions in an RDBMS can be a single document in MongoDB.

Today’s eyebrow raiser, however, came from the way MongoDB handles indexing. Let me illustrate with a traditional SQL schema for user accounts. I want to handle three different scenarios for how user accounts come to exist. Users may sign up on the web, in which case they will have to pick a username. Users may be introduced via email (the way Posterous and TripIt work), in which case they have an email address but no username, or users may login the first time via OpenID, in which case neither username or email address is known at the time of account creation. An account may exist with any one of these three identifiers, but each is expected to be unique across the system. In SQLAlchemy:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Unicode, Integer
Base = declarative_base()

class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    username = Column(Unicode(50), nullable=True, unique=True)
    email = Column(Unicode(50), nullable=True, unique=True)
    openid = Column(Unicode(250), nullable=True, unique=True)

I have three columns here, all of which are unique but optional. From SQLite’s documentation:

The UNIQUE constraint causes an unique index to be created on the specified columns. All NULL values are considered different from each other and from all other values for the purpose of determining uniqueness, hence a UNIQUE column may contain multiple entries with the value of NULL.

See? That’s straightforward. If the field contains a value, it has to be unique. If there’s no value (ie, null), that’s okay too. Job done. I figured it would work the same way with MongoDB, so I whipped this up with MongoEngine, a nice ORMish wrapper that allows defining basic validation on models:

from mongoengine import *

class User(Document):
    username = StringField(required=False, unique=True)
    email = StringField(required=False, unique=True)
    openid = StringField(required=False, unique=True)

This should work, right? It doesn’t. Behind the scenes, MongoEngine converts this model into a MongoDB index:

db.user.ensureIndex({username: 1}, {unique: true});
db.user.ensureIndex({email: 1}, {unique: true});
db.user.ensureIndex({openid: 1}, {unique: true});

MongoDB is schema-free, so these keys may or may not exist in the document, but here is what the documentation has to say about unique indexes:

When a document is saved to a collection with unique indexes, any missing indexed keys will be inserted with null values. Thus, it won’t be possible to insert multiple documents missing the same indexed key.

In MongoDB, null is also considered a unique value. Bummer. I finally came up with this solution:

from mongoengine import *

class User(Document):
    username = StringField(required=False)
    email = StringField(required=False)
    openid = StringField(required=False)

    meta = {
        'indexes': ['username', 'email', 'openid'],
        }

    def __repr__(self):
        return u"<User '%s'>" % (self.username or self.email or self.openid)

    def validate(self):
        super(User, self).validate()
        if not self.username and not self.email and not self.openid:
            raise ValidationError(u"One of 'username', 'email' or 'openid' must be provided.")

        # Check for uniqueness of username, email and openid.
        if self.username:
            existing = User.objects(username=self.username).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"Username '%s' already in use." % self.username)
        if self.email:
            existing = User.objects(email=self.email).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"Email '%s' already in use." % self.email)
        if self.openid:
            existing = User.objects(openid=self.openid).first()
            if existing and existing.id != self.id:
                raise ValidationError(u"OpenID '%s' already in use." % self.openid)

MongoEngine calls instance.validate() during a save(), so the constraints get applied. This isn’t particularly clean. The same check is repeated thrice, once for each field, and that’s up to three queries every time I want to save data. However, I don’t expect optional-but-unique to be a recurring pattern, and user accounts are infrequently edited, so this should be okay for now.

Analysing Wikipedia: caching data

I haven’t posted about Wikipedia in a while. Hans went to Ladakh right after I returned, so we’re only now getting around to analysing the data we collected in July.

Our biggest hassle with doing any kind of analysis is with how long it takes to retrieve data. A full text analysis of a few hundred revisions of a large page could easily take an hour to pull. If that analysis doesn’t produce satisfying results, attempting a variation requires pulling that data all over again, because we have no cache.

I use the mwclient library, which provides a thin wrapper representing MediaWiki queries as lazy (?) Python sequences. Since this sequence could be cached, I’ve been considering strategies (some of this assumes familiarity with mwclient):

  1. Implement a simple Python dictionary cache around the mwclient API, saving the query→result mapping as a pickled dump and consult that before hitting the servers again. This is easy, but since the sequences are lazy, the data isn’t available for caching until the code tries to access it. The cache has to intervene then. All my analysis code must now be written for two API’s, mwclient’s and my cache’s.

  2. Alternatively, do the same thing but as a patch to mwclient’s code so there’s a single external API. This requires understanding how it works and maintaining patches against upstream changes, which takes time away from analysis.

  3. Do it outside. Setup Squid or another caching proxy to cache everything regardless of HTTP headers. Make queries through this. Easy to setup, but grossly inefficient. Proxy servers understand request→response mapping, not sequences. If I ask for a subset of an earlier sequence, it’ll treat it as a new request. Sequences require special treatment:

    • There could be newer edits on the site, making the sequence’s beginning and end markers stale.

    • A new query may ask for overlapping results (typically, a query from a fixed starting point to current time). The cache should be able to join sequences instead of duplicating data.

    • A query may ask for the same time range as an earlier query, but with additional properties (typically, the full text of each revision). These additional properties should be merged into the cache.

  4. Drop this approach altogether and get a static dump of the Wikipedia database. But a full text dump of the entire revision history of the English Wikipedia is 150 terabytes. The resource requirements will take us out of the realm of a hobbyist project.

Given that data retrieval time has become a serious hobble, it seems worth tackling this head-on. A custom cache API could:

  1. Be sequence aware. Treat each MediaWiki article as being a sequence of unknown start and end, of which fragments are available in the cache. Join sequence fragments as data gaps are filled in, leading to one single sequence for the page’s entire revision history.

  2. Store additional properties on each revision. MediaWiki does not store diffs between revisions, but the cache could, since much full text analysis is based around the changes introduced by each revision. Properties could also be flags marking pages as, for example, vandalism, or the following reversion.

  3. Based on the above, store alternate sequences and properties specific to them. For example, a revision sequence of an article that skips all vandal/reversion revisions and stores edit diffs without them. Without this, an editor whose sole contribution was to revert vandalism will come out appearing to have added a lot of new material.

A web service implementing this API will, over time, be able to respond to queries in near real time, making it possible to build a web interface where anyone can submit a query. The public web interface is one of our eventual goals for this project.

I’ll post updates as I work out the technical architecture for this API. I’m considering using one of the newfangled key/value pair databases but have no experience with them. Recommendations are welcome.

Querying Wikipedia with mwclient

mwclient is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we’ll look at how we can use mwclient to query any MediaWiki-powered site for the information we want.

Installing mwclient

As of this writing, the 0.6.2 release of mwclient does not include an installer and isn’t available in the Python Package Index, so installation is a bit of a chore. Grab the latest release from the downloads page; it should uncompress to reveal an mwclient folder. Copy this folder to your Python’s site-packages folder. If you don’t know where that is, type this at the command line:

python -c "from distutils.sysconfig import get_python_lib; print get_python_lib()"

The following locations are typical:

/usr/lib/python2.x/site-packages
/var/lib/python-support/python2.x
/Library/Python/2.x/site-packages

Launch Python and confirm it’s installed:

>>> import mwclient

If that didn’t raise any errors, congratulations! You’re all set to go.

Using mwclient

Here’s how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:

>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
...     rev = revisions.next()
...     pprint(rev)
... 
{'revid': 290932490,
 'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
 'user': 'Benlisquare'}
{'anon': '',
 'revid': 290930263,
 'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
 'user': '62.254.235.147'}
{'anon': '',
 'revid': 290930082,
 'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
 'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
 'revid': 290927544,
 'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
 'user': 'SoxBot'}
{'anon': '',
 'revid': 290927187,
 'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
 'user': '62.254.235.147'}

Compare the output you get with the page’s revision history on Wikipedia. They should match.

Calling page.revisions() gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon key indicates an anonymous edit; user then contains the editor’s IP address instead of user name. All keys and string values will be Unicode strings.

To get all edits between two dates in the forward direction, with the text content of each revision, do this:

>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
...                            end='2009-05-19T23:59:59Z',
...                            dir='newer',
...                            prop='ids|timestamp|flags|comment|user|content')

And here’s how to get all the edits of any given user. Let’s look at SoxBot from the revisions above:

>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
...     rev = contribs.next()
...     pprint(rev)
... 
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 17244650,
 'revid': 290942689,
 'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
 'title': 'User talk:Twinzor',
 'top': '',
 'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 21352732,
 'revid': 290942678,
 'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
 'title': 'User talk:Turco85',
 'top': '',
 'user': 'SoxBot'}

Notes

  1. MediaWiki timestamp strings can be generated using "%Y-%m-%dT%H:%M:%SZ" as format string with Python’s datetime.strftime. All timestamps must be in UTC.

  2. You can pass a combination of parameters to page.revisions() to get revisions the way you want them. You can even skip the dates and call with startid or endid = any revision number (see revid in the output), to retrieve revisions before or after that one.

  3. To look at what parameters the page.revisions() and site.usercontributions() functions take, use Python’s built-in help browser:

    >>> help(page.revisions)
    >>> help(site.usercontributions)
    

Hope that’s enough to get you started. In subsequent posts I’ll explain how we can use this to analyse Wikipedia revision history.

.Net, Mono, Gtk# and IronPython on Mac OS X

Today’s development platform test was on Mono and IronPython with Gtk#. I was testing the feasibility of achieving three goals:

  1. Support both Windows and Linux with a single codebase, using either Microsoft’s .Net Framework or Mono.
  2. Provide a smooth learning curve for my existing .Net dev team.
  3. Use Python for rapid development, without resulting in disjoint codebases.

My current requirement is for several hundred Windows desktops that I’d rather be using a more maintainable operating system on. I’m bound to Windows by third party software that I must support, but don’t want my team’s own work to become a barrier to migration in future.

Mono doesn’t support Windows.Forms very well, so if I am to adopt it, the first task will be to replace WinForms with Gtk# in existing apps. For the evaluation today, I tried the most basic: getting Mono and Gtk# working, and maybe writing a “Hello, World” Gtk# app in Python. My primary environment is a Mac, which made this evaluation more interesting. Could I actually do my development on OS X and deploy on Windows and Linux? The Mono distribution for Mac OS X includes IronPython but not Gtk# (Cocoa# is bundled instead). Gtk# requires the Gtk+ library, which is not a standard component on OS X, and hence Mono’s understandable omission.

Compiling Gtk# involved jumping a few hoops. Basically: symlink the Mono C# compiler /usr/bin/mcs to /usr/local/bin/csc.exe, because Gtk#’s configure script assumes a Microsoft .Net Framework environment, and then edit all Makefiles to set RUNTIME = mono. That compiled it. To install, I had to create gtk-sharp-2.0 under /Library/Frameworks/Mono.framework/Versions/1.2.1/lib/mono and throw in symlinks to all the DLLs from their source location in ../gac.

I then tried this simple Gtk# example in C#:

using Gtk;
using System;

class Hello {

    static void Main()
    {
        Application.Init ();

        Window window = new Window ("helloworld");
        window.Show();

        Application.Run ();

    }
}

But it wouldn’t compile:

$ mcs HelloGtk.cs -pkg:gtk-sharp-2.0
Package gtk-sharp-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `gtk-sharp-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'gtk-sharp-2.0' found
error CS8027: Error running pkg-config. Check the above output.

Turned out there were two copies of pkg-config, one from Fink and one from Mono. The solution was to set a common environment variable:

export PKG_CONFIG_PATH=/sw/lib/pkgconfig\
:/usr/lib/pkgconfig\
:/Library/Frameworks/Mono.framework/Versions/1.2.1/lib/pkgconfig

The code compiled but still wouldn’t run:

$ mono HelloGtk.exe 

Unhandled Exception: System.DllNotFoundException: libgtk-x11-2.0.0.dylib
  at (wrapper managed-to-native) Gtk.Application:gtk_init (int&,intptr&)
  at Gtk.Application.Init () [0x00000] 
  at Hello.Main () [0x00000] 

Mono couldn’t find the Gtk+ libraries in /sw/lib. Fixing this was uglier; it required messing with the DYLD_LIBRARY_PATH environment variable:

$ export DYLD_LIBRARY_PATH=/sw/lib\
:/Library/Frameworks/Mono.framework/Versions/1.2.1/lib

HelloGtk.exe would run after this, but if I tried anything else, it failed complaining about missing libraries. I couldn’t find a solution. Anyway, I moved on Gtk# from Python, translating the code above line for line. It wouldn’t work. No known module named “Gtk”. Much head-scratching and Googling later, I learnt that IronPython needs a special invocation for .Net assemblies (that’s .Net speak for “libraries”). Here’s the code:

import clr
clr.AddReference('gtk-sharp')

import Gtk

Gtk.Application.Init()
window = Gtk.Window("helloworld")
window.Show()
Gtk.Application.Run()

Kushal has a slightly more elaborate example.

Generating combinations in Python

I recently wrote some code that required iterating through all combinations of a list of lists. I couldn’t find an existing function to do this, so wrote one:

def listcombinations(listoflists, curlist=[], parents=[]):
    """
    Generator that yields all possible combinations from a list of lists.

    >>> a = [[1, 2], [3, 4], [5, 6], [7, 8]]
    >>> for c in listcombinations(a): print c
    ...
    [1, 3, 5, 7]
    [1, 3, 5, 8]
    [1, 3, 6, 7]
    [1, 3, 6, 8]
    [1, 4, 5, 7]
    [1, 4, 5, 8]
    [1, 4, 6, 7]
    [1, 4, 6, 8]
    [2, 3, 5, 7]
    [2, 3, 5, 8]
    [2, 3, 6, 7]
    [2, 3, 6, 8]
    [2, 4, 5, 7]
    [2, 4, 5, 8]
    [2, 4, 6, 7]
    [2, 4, 6, 8]
    """
    if curlist == []:
        curlist = listoflists[0]
        remlist = listoflists[1:]
    else:
        remlist = listoflists
    for item in curlist:
        if len(remlist) > 0:
            for c in listcombinations(remlist[1:], remlist[0], parents+[item]):
                yield c
        else:
            yield parents+[item]

In the code above, the function returns combinations in a sequence such that each consecutive combination differs by only one item. My use case required consecutive combinations to be as unlike each other as possible. After mulling over it a bit, I figured I’d just randomise the entire set. This presented a new problem. On a production run, it turned out my list of lists summed up to over three million combinations. I only needed about ten and, neither picking the first ten nor generating the full list and picking a random ten was feasible: insufficiently distinct or took too long, respectively.

The solution? A lazy list instead of a generator:

class Combinations:
    """
    Holds all possible combinations of a given list of lists. Behaves like
    a lazy list.

    >>> a = [[1, 2], [3, 4], [5, 6], [7, 8]]
    >>> for c in Combinations(a): print c
    ...
    [1, 3, 5, 7]
    [2, 3, 5, 7]
    [1, 4, 5, 7]
    [2, 4, 5, 7]
    [1, 3, 6, 7]
    [2, 3, 6, 7]
    [1, 4, 6, 7]
    [2, 4, 6, 7]
    [1, 3, 5, 8]
    [2, 3, 5, 8]
    [1, 4, 5, 8]
    [2, 4, 5, 8]
    [1, 3, 6, 8]
    [2, 3, 6, 8]
    [1, 4, 6, 8]
    [2, 4, 6, 8]
    >>> print Combinations(a)[12]
    [1, 3, 6, 8]
    """
    def _product(self, sequence):
        "Returns product of items in sequence."
        if not sequence: return 0
        return reduce(lambda x,y: x*y, sequence)

    def __init__(self, listoflists):
        self.length = self._product([len(col) for col in listoflists])
        self.listoflists = listoflists

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        if index >= self.length:
            raise IndexError, "Index out of range."
        r = []
        p = 1
        for col in range(len(self.listoflists)):
            i = int(index/p) % len(self.listoflists[col])
            r.append(self.listoflists[col][i])
            p *= len(self.listoflists[col])
        return r

Here’s the code packaged as a Python module.