Tuesday, March 22, 2011
Spam-proofing email addresses
I built a job board for HasGeek earlier this month. Announcement here. Many job boards track responses to an application by asking candidates to upload a resume. I didn’t like the UX of this approach, so I used a free-form text box where employers can indicate how they want candidates to apply. Most provide an email address.
The trouble with email addresses is that leaving one in the open inevitably attracts spam. Most of us are used to obscuring it as kiran(at)hasgeek(dot)in or similar, but this is bad for usability. I wanted email addresses to appear as proper links — clicking on one should open a mail client — and yet be hidden from spam bots. Someone had done a 1.5 year study on nine different obfuscation methods, with three that worked. I decided to apply two of them to the job board.
Consider this job listing. My email address appears at the bottom as a normal clickable link, but if you look at the page source, you’ll see this:
<a class="rot13" data-href="znvygb:xvena@unftrrx.va"><span class="y">kiran</span><span class="z">no</span><span class="y">@hasg</span><span class="z">spam</span><span class="y">eek.in</span></a>
This code uses two of the methods: garbage text that is hidden via CSS, and a ROT13 encoded link that is decoded by JavaScript.
The CSS
This declaration is included in the site’s stylesheets:
/* Spam protection */ .z {display: none;}
There is no declaration for .y. That’s a dummy.
The JavaScript
This code (activated via jQuery) finds all ROT13 links and turns them into real links:
// ROT13 link handler $(function() { $("a.rot13").each(function() { if ($(this).attr('data-href')) { var decoded = $(this).attr('data-href').replace(/[a-zA-Z]/g, function(c) { return String.fromCharCode((c<="Z"?90:122)>=(c=c.charCodeAt(0)+13)?c:c-26); }); $(this).attr('href', decoded); $(this).removeAttr('data-href'); $(this).removeClass('rot13'); }; }); });
The HTML uses data-href instead of href to avoid broken links when Javascript is disabled. data-href is a custom data attribute, a new feature in HTML5.
Backend
The job board is built with the Flask microframework and Jinja2 templates. Here’s the template snippet:
<div class="section"> <h2>Apply for this position</h2> <p>{{ post.how_to_apply|scrubemail(('z', 'y')) }}</p> </div>
scrubemail is a Jinja2 filter, defined thus:
from flask import Flask, Markup, escape app = Flask(__name__) @app.template_filter('scrubemail') def scrubemail_filter(data, css_junk=''): return Markup(scrubemail(unicode(escape(data)), rot13=True, css_junk=css_junk))
The post.how_to_apply|scrubemail(('z', 'y')) declaration in Jinja2 translates to scrubemail_filter(post.how_to_apply, ('z', 'y')) in Python. Markup is a Flask extension to mark text as HTML-safe. Calling escape returns a Markup instance, so we convert it back to unicode before passing it to the actual scrubemail function, which isn’t Markup-aware.
And finally, the scrubemail function:
import re EMAIL_RE = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b', re.I) def scrubemail(data, rot13=False, css_junk=None): """ Convert email addresses in text into HTML links, and optionally obfuscate them with ROT13 and empty CSS classes, to hide from spambots. >>> scrubemail(u"Send email to test@example.com and you are all set.") u'Send email to <a href="mailto:test@example.com">test@example.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=True) u'Send email to <a class="rot13" data-href="znvygb:grfg@rknzcyr.pbz">test@example.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=True, css_junk='z') u'Send email to <a class="rot13" data-href="znvygb:grfg@rknzcyr.pbz">test@<span class="z">no</span>examp<span class="z">spam</span>le.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=False, css_junk=('z', 'y')) u'Send email to <span class="y">test@</span><span class="z">no</span><span class="y">examp</span><span class="z">spam</span><span class="y">le.com</span> and you are all set.' """ def convertemail(m): aclass = ' class="rot13"' if rot13 else '' email = m.group(0) link = 'mailto:' + email if rot13: link = link.decode('rot13') if css_junk and len(email)>3: third = int(len(email) / 3) parts = (email[:third], email[third:third*2], email[third*2:]) if isinstance(css_junk, (tuple, list)): css_dirty, css_clean = css_junk email = '<span class="%s">%s</span><span class="%s">no</span>'\ '<span class="%s">%s</span><span class="%s">spam</span>'\ '<span class="%s">%s</span>' % ( css_clean, parts[0], css_dirty, css_clean, parts[1], css_dirty, css_clean, parts[2]) else: email = '%s<span class="%s">no</span>%s<span class="%s">spam</span>%s' % ( parts[0], css_junk, parts[1], css_junk, parts[2]) email = email.replace('@', '@') return '<a%s data-href="%s">%s</a>' % (aclass, link, email) data = EMAIL_RE.sub(convertemail, data) return data
How effective is this?
As you can see, extracting a real email address from the obfuscated version is trivial — if you know what you are looking for. It’ll take just five minutes to write a scraper that runs through the site extracting email addresses. This technique is useless on large sites that are targeted by spammers. For a small site like our job board, however, standard scrapers will fail to pick up anything.