Friday, July 2, 2010
Optional unique fields with MongoDB
I’m trying out MongoDB for a new project and am having an interesting time wrapping my head around the way it works. I’ll post notes to this blog whenever there’s something particularly interesting. Today’s is one such.
MongoDB is a schema-free key:value store, where the value is a JSON document. It’s the leading contender in the NoSQL movement. JSON documents can be of arbitrary depth, just like XML, storing anything from a single snippet to an entire database. This flexibility makes me happy. Schemas that required tedious snowflake definitions in an RDBMS can be a single document in MongoDB.
Today’s eyebrow raiser, however, came from the way MongoDB handles indexing. Let me illustrate with a traditional SQL schema for user accounts. I want to handle three different scenarios for how user accounts come to exist. Users may sign up on the web, in which case they will have to pick a username. Users may be introduced via email (the way Posterous and TripIt work), in which case they have an email address but no username, or users may login the first time via OpenID, in which case neither username or email address is known at the time of account creation. An account may exist with any one of these three identifiers, but each is expected to be unique across the system. In SQLAlchemy:
from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Unicode, Integer Base = declarative_base() class User(Base): __tablename__ = 'user' id = Column(Integer, primary_key=True) username = Column(Unicode(50), nullable=True, unique=True) email = Column(Unicode(50), nullable=True, unique=True) openid = Column(Unicode(250), nullable=True, unique=True)
I have three columns here, all of which are unique but optional. From SQLite’s documentation:
The UNIQUE constraint causes an unique index to be created on the specified columns. All NULL values are considered different from each other and from all other values for the purpose of determining uniqueness, hence a UNIQUE column may contain multiple entries with the value of NULL.
See? That’s straightforward. If the field contains a value, it has to be unique. If there’s no value (ie, null), that’s okay too. Job done. I figured it would work the same way with MongoDB, so I whipped this up with MongoEngine, a nice ORMish wrapper that allows defining basic validation on models:
from mongoengine import * class User(Document): username = StringField(required=False, unique=True) email = StringField(required=False, unique=True) openid = StringField(required=False, unique=True)
This should work, right? It doesn’t. Behind the scenes, MongoEngine converts this model into a MongoDB index:
db.user.ensureIndex({username: 1}, {unique: true}); db.user.ensureIndex({email: 1}, {unique: true}); db.user.ensureIndex({openid: 1}, {unique: true});
MongoDB is schema-free, so these keys may or may not exist in the document, but here is what the documentation has to say about unique indexes:
When a document is saved to a collection with unique indexes, any missing indexed keys will be inserted with null values. Thus, it won’t be possible to insert multiple documents missing the same indexed key.
In MongoDB, null is also considered a unique value. Bummer. I finally came up with this solution:
from mongoengine import * class User(Document): username = StringField(required=False) email = StringField(required=False) openid = StringField(required=False) meta = { 'indexes': ['username', 'email', 'openid'], } def __repr__(self): return u"<User '%s'>" % (self.username or self.email or self.openid) def validate(self): super(User, self).validate() if not self.username and not self.email and not self.openid: raise ValidationError(u"One of 'username', 'email' or 'openid' must be provided.") # Check for uniqueness of username, email and openid. if self.username: existing = User.objects(username=self.username).first() if existing and existing.id != self.id: raise ValidationError(u"Username '%s' already in use." % self.username) if self.email: existing = User.objects(email=self.email).first() if existing and existing.id != self.id: raise ValidationError(u"Email '%s' already in use." % self.email) if self.openid: existing = User.objects(openid=self.openid).first() if existing and existing.id != self.id: raise ValidationError(u"OpenID '%s' already in use." % self.openid)
MongoEngine calls instance.validate() during a save(), so the constraints get applied. This isn’t particularly clean. The same check is repeated thrice, once for each field, and that’s up to three queries every time I want to save data. However, I don’t expect optional-but-unique to be a recurring pattern, and user accounts are infrequently edited, so this should be okay for now.