Archive for October 2003

Me heart Bangalore traffic.

In Bangalore this week

I will be in Bangalore from Sunday morning to Wednesday the next Monday afternoon. This time, I’m on holiday (theoretically) and so my schedule is a lot more relaxed. As always, call if you want to meet up. How about a LiveJournal meeting on Sunday evening?

Knowledge vs. Skill

I’m starting to realise that the trump card for any job seeker is not the ability to do something, but the knowledge of what needs to be done.

We techies are used to thinking of knowledge only in terms of technology: choosing components for a Web server today so that it will stand up to heavy loads one year later, standardising on XML today as an interchange meta-format because as a human readable text format it’s guaranteed to remain usable forever, or hosting a project at SourceForge instead of a more reliable personal server because SourceForge will make the project more attractive to other developers.

What we don’t usually realise is just how incredibly huge this knowledge business is. Market surveys, business forecasts, news media, entire industries dedicated to understanding the what rather than the how.

Is the quality of Indian writing improving?

14 of the 22 most recent books I’ve bought are either written by Indian authors or are about India. This is a very unusual statistic for me. Either the quality of Indian writing has improved significantly in the last few years, or my reading preferences have changed.

Public, private and secret speech

For those of you facing existential angst regarding keeping a journal online, here's a brilliant post by Danny O’Brien on the difference between public, private and secret speech, and how private speech is fast disappearing on the Web.
In the real world, we have conversations in public, in private, and in secret. All three are quite separate. The public is what we say to a crowd; the private is what we chatter amongst ourselves, when free from the demands of the crowd; and the secret is what we keep from everyone but our confidant. Secrecy implies intrigue, implies you have something to hide. Being private doesn't. You can have a private gathering, but it isn't necessarily a secret. All these conversations have different implications, different tones.

Most people have, in the back of their mind, the belief that what they say to their friends, they would be happy to say in public, in the same words. It isn't true, and if you don't believe me, tape-record yourself talking to your friends one day, and then upload it to your website for the world to hear.

This is the trap that makes fly-on-the-wall documentaries and reality TV so entertaining. It's why politicians are so weirdly mannered, and why everyone gets a bit freaked out when the videocamera looms at the wedding. It's what makes a particular kind of gossip - the "I can't believe he said that!" - so virulent. No matter how constant a person you are, no matter how unwavering your beliefs, something you say in the private register will sound horrific, dismissive, egotistical or trite when blazoned on the front page of the Daily Mirror. This is the context that we are quoted out of.

But in the real world, private conversations stay private. Not because everyone is sworn to secrecy, but because their expression is ephemeral and contained to an audience. There are few secrets in private conversations; but in transmitting the information contained in the conversation, the register is subtly changed. I say to a journalist, "Look, Dave, err, frankly the guy is a bit, you know. Sheesh. He's just not the sort of person that we'd ever approve of hiring.". The journalist, filtering, prints, "Sources are said to disapprove of the appointment.".

Secrets have another register. They are serious (even when they are funny secrets). We are both implicated when we share a secret. We hide it from the world. Secrets don't change register - when they are out, they preserve their damaging style.

On the net, you have public, or you have secrets. The private intermediate sphere, with its careful buffering. is shattered. E-mails are forwarded verbatim. IRC transcripts, with throwaway comments, are preserved forever. You talk to your friends online, you talk to the world.

This is why, incidentally, why people hate blogs so much. My God, people say, how can Livejournallers be so self-obsessed? Oh, Christ, is Xeni talking about LA art again? Why won't they all shut up?

The answer why they won't shut up is - they're not talking to you. They're talking in the private register of blogs, that confidential style between secret-and-public. And you found them via Google. They're having a bad day. They're writing for friends who are interested in their hobbies and their life. Meanwhile, you're standing fifty yards away with a sneer, a telephoto lens and a directional microphone. Who's obsessed now?
Via Clay Shirky.

Restlessness

I’ve been considerably restless for several days now, but I can’t make out what I’m so restless about. It’s as if an avalanche of spaghetti thoughts has clogged the pipe. Bits and pieces are falling out, too badly fragmented to be put back together again. Longer, more interesting bits appear to be stuck inside, all of them trying to get out simultaneously, none of them making it through.

The pipe has only so much width. The pipe is not friction free. And no one’s around to help calm the rush and pull them out one at a time, still in coherent form.

I can’t tell what is bothering me, but I do know that this is neither burnout nor information overload. I’ve been both places before and this is very different. I like my work. I spend up to five hours every day reading books or news feeds. My RSS aggregator has 45 feeds that I have no trouble keeping up with. This isn’t burnout or information overload.

I’ve considered spending a week meditating at some remote beach, but the isolation may only cause more damage. And then I’ve considered packing a bag and hitting the road without an itinerary, recording experiences in as much detail as possible, in prose and visuals. The idea appeals. Maybe soon.

Music and Location

I’m listening to Len’s Steal My Sunshine. For some reason, this song reminds me of the crossing of Charles & Eager Street in Baltimore, particularly the view from the south eastern corner.

It would have made sense if I listened to this track often when in Baltimore, but I didn't.

Metacrap: Why the Semantic Web will not happen

Cory Doctorow in 2001 wrote an excellent rebuttal to the Semantic Web dream, explaining why human nature (people lie, people are lazy, people are stupid) will ensure that reliable metadata, the basis of the Semantic Web, will not be created.

Cory ends his argument by pointing out that while explicit metadata will never be trustworthy, implicit metadata, such as what Google uses for PageRank, already is. It's no wonder then that Google ignores explicitly defined metadata in HTML pages (except for things like language settings that affect how a human sees the page).

Human Protein Reference Database

The Human Protein Reference Database has been unveiled. Here's the paper published in Genome Research, and an accompanying press release. Neither paper nor press release mentions it, but the source code to the site (not including templates) is open source and available under the LGPL license.

Congratulations to the teams at Pandey Lab and Institute of Bioinformatics, who worked their asses off for over a year to make this happen.

Two people not co-authors on the paper, but who also contributed, are Sandeep Chandur ([info]irq2), who wrote the XML parsing routines for the initial data import into Zope, and Allan Stanley ([info]killapop), who designed the user interface.

The Library model vs. the Web model for organising knowledge

In defining the user interface for my media server project at Synapse, I'm faced with a fundamental dichotomy, what I will call the Library model of organising knowledge vs. the Web model. To illustrate, a quote from Tim Berners-Lee, et al in Scientific American, May 17, 2001:
Traditional knowledge-representation systems typically have been centralized, requiring everyone to share exactly the same definition of common concepts such as "parent" or "vehicle." But central control is stifling, and increasing the size and scope of such a system rapidly becomes unmanageable.

Moreover, these systems usually carefully limit the questions that can be asked so that the computer can answer reliably -- or answer at all. The problem is reminiscent of Gödel's theorem from mathematics: any system that is complex enough to be useful also encompasses unanswerable questions, much like sophisticated versions of the basic paradox "This sentence is false." To avoid such problems, traditional knowledge-representation systems generally each had their own narrow and idiosyncratic set of rules for making inferences about their data. For example, a genealogy system, acting on a database of family trees, might include the rule "a wife of an uncle is an aunt." Even if the data could be transferred from one system to another, the rules, existing in a completely different form, usually could not.

Semantic Web researchers, in contrast, accept that paradoxes and unanswerable questions are a price that must be paid to achieve versatility. We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired. This philosophy is similar to that of the conventional Web: early in the Web's development, detractors pointed out that it could never be a well-organized library; without a central database and tree structure, one would never be sure of finding everything. They were right. But the expressive power of the system made vast amounts of information available, and search engines (which would have seemed quite impractical a decade ago) now produce remarkably complete indices of a lot of the material out there. The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web.

In the media server, the Library/Web dichotomy applies to how the user interface is organised.

Library: Centralised. High barrier to entry. An image cannot be put in the media server unless it’s first richly annotated. No annotations, no entry. Whatever is in the media server is guaranteed to be very well described.

Web: Distributed. Low barrier to entry. First put it in the media server, then annotate it when convenient. Since the quality of annotations is suspect, a search engine that ranks on this basis is essential.

Further, there are two psychological factors that affect everyone who contributes to the media server: Discipline and Enthusiasm.

Discipline determines how seriously I take my annotation work. The Library model enforces discipline. If I don’t annotate it, it doesn’t get in. The Web model considers annotation optional, so my existing discipline determines whether I will annotate or not.

Enthusiasm determines the comfort/interest level in submitting to the media server. For example, I take photographs all the time, of varying levels of quality. I annotate them for my own reference. I’ve never considered submitting them to any kind of stock photograph archive or photography contest because, depending on my mood, I either don’t think them good enough, or I don’t see the point in making such a submission. But this lack of interest doesn’t mean my photographs aren’t good enough.

In the Library model, enthusiasm is important for an item to be submitted. In the Web model, enthusiasm is irrelevant. I annotate it for my own reference, but the very act of annotation makes it available to anyone who comes looking for it.

So which of these models should the media server use? I prefer the Web model for these reasons:

1. The Library model makes annotations the barrier to entry, but annotations don’t speak for the quality of the content itself. The barrier here is one of convenience rather than fairness. It assumes that if you think the image is high quality, you will annotate it well.

2. Management lessons from over the ages have taught us that it is far easier to enforce discipline than to create enthusiasm. The Library model requires high enthusiasm to be effective; the Web model doesn’t. This makes more sense when you consider enthusiasm as the opposite of inhibition.

3. The Web model gives users the freedom to organise as they like. Most people already have some rudimentary organisation in a folder hierarchy on their machines. This is lost in a centralised Library model but preserved in the Web model. The user’s organisation may sometimes define relationships between items that cannot be reproduced within the limited parameters of the Library model. For example, if I went to the beach last week and took a bunch of photographs, I will presumably put them all in the same folder. I may have been fascinated by something I saw and taken multiple photographs of it. These, I will put in a separate sub-folder. To another person now looking at my collection, the presence of the sub-folder clearly indicates that its contents are somehow related to each other in a way that doesn't apply to the photographs in the parent folder. The nature of this relationship is not explicitly specified but can be understood by examining the photographs (which a computer cannot do). In the Library model with its restricted system of organisation, this relationship may be entirely lost.

Grokking resources and relationships

A project I’m currently working on, a media repository for internal use at Synapse, requires that files in the repository can be marked as related to each other in one of several ways. A few examples:

If I have a file in two different formats (say, TIFF and JPEG), both files should be marked as identical to each other. If the image was modified in some way so that it is different from the original, but not an entirely new image, then the two files should be marked as variations of each other. If a person’s photograph is used in an advertisement, then the files should be marked as one being used within the other. In this case, the relationship is asymmetric: “uses” at one end and “used in” at the other.

I’ve spent the last month trying to build a structurally sound model for storing properties and defining relationships, and I have to admit, the model remains just as hazy no matter how hard I try to define it.

I’ve lately taken to reading up on other attempts at classification and at defining relationships:

Iconclass is a subject specific international classification system for iconographic research and the documentation of images. Here’s an explanation of how it works. Iconclass was designed for classifying western art and isn’t well tuned for an archive of modern media, but it still does an excellent job. Unfortunately, Iconclass is copyrighted and requires an expensive license to even use the classification system. Maybe it’s my open source evangelist background, but I fail to understand why a resource like this is not available for free.

ISO 2788, the ISO standard for a thesaurus, defines nine relationships between words: synonym, related term, alternate term, narrower term, broader term, narrower term instantive, narrower term partitive, broader term instantive, broader term partitive (Once again, you have to pay for a copy of the specification).

Resource Description Framework (RDF) is a comprehensive XML-based syntax for describing a resource but is unfortunately weighed down by its own complexity (RSS 1.0 vs. 2.0 being a prominent example). FOAF is an example of an RDF-based syntax for defining relationships between people, using an MD5 hash of the email address to identify a person.

The Semantic Web project, from what I understand of it so far, wants all resources to be identifiable with a URI, but is stuck with the more fundamental problem of defining what a resource is. Shelley Powers, author of O’Reilly’s Practical RDF, has an excellent explanation. Here’s another from Richard MacManus explaining why the Semantic Web is like Moby Dick.

My current reading: Practical RDF and Information Architecture for the World Wide Web, both from O'Reilly.