Tuesday, March 22, 2011
Spam-proofing email addresses
I built a job board for HasGeek earlier this month. Announcement here. Many job boards track responses to an application by asking candidates to upload a resume. I didn’t like the UX of this approach, so I used a free-form text box where employers can indicate how they want candidates to apply. Most provide an email address.
The trouble with email addresses is that leaving one in the open inevitably attracts spam. Most of us are used to obscuring it as kiran(at)hasgeek(dot)in or similar, but this is bad for usability. I wanted email addresses to appear as proper links — clicking on one should open a mail client — and yet be hidden from spam bots. Someone had done a 1.5 year study on nine different obfuscation methods, with three that worked. I decided to apply two of them to the job board.
Consider this job listing. My email address appears at the bottom as a normal clickable link, but if you look at the page source, you’ll see this:
<a class="rot13" data-href="znvygb:xvena@unftrrx.va"><span class="y">kiran</span><span class="z">no</span><span class="y">@hasg</span><span class="z">spam</span><span class="y">eek.in</span></a>
This code uses two of the methods: garbage text that is hidden via CSS, and a ROT13 encoded link that is decoded by JavaScript.
The CSS
This declaration is included in the site’s stylesheets:
/* Spam protection */ .z {display: none;}
There is no declaration for .y. That’s a dummy.
The JavaScript
This code (activated via jQuery) finds all ROT13 links and turns them into real links:
// ROT13 link handler $(function() { $("a.rot13").each(function() { if ($(this).attr('data-href')) { var decoded = $(this).attr('data-href').replace(/[a-zA-Z]/g, function(c) { return String.fromCharCode((c<="Z"?90:122)>=(c=c.charCodeAt(0)+13)?c:c-26); }); $(this).attr('href', decoded); $(this).removeAttr('data-href'); $(this).removeClass('rot13'); }; }); });
The HTML uses data-href instead of href to avoid broken links when Javascript is disabled. data-href is a custom data attribute, a new feature in HTML5.
Backend
The job board is built with the Flask microframework and Jinja2 templates. Here’s the template snippet:
<div class="section"> <h2>Apply for this position</h2> <p>{{ post.how_to_apply|scrubemail(('z', 'y')) }}</p> </div>
scrubemail is a Jinja2 filter, defined thus:
from flask import Flask, Markup, escape app = Flask(__name__) @app.template_filter('scrubemail') def scrubemail_filter(data, css_junk=''): return Markup(scrubemail(unicode(escape(data)), rot13=True, css_junk=css_junk))
The post.how_to_apply|scrubemail(('z', 'y')) declaration in Jinja2 translates to scrubemail_filter(post.how_to_apply, ('z', 'y')) in Python. Markup is a Flask extension to mark text as HTML-safe. Calling escape returns a Markup instance, so we convert it back to unicode before passing it to the actual scrubemail function, which isn’t Markup-aware.
And finally, the scrubemail function:
import re EMAIL_RE = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b', re.I) def scrubemail(data, rot13=False, css_junk=None): """ Convert email addresses in text into HTML links, and optionally obfuscate them with ROT13 and empty CSS classes, to hide from spambots. >>> scrubemail(u"Send email to test@example.com and you are all set.") u'Send email to <a href="mailto:test@example.com">test@example.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=True) u'Send email to <a class="rot13" data-href="znvygb:grfg@rknzcyr.pbz">test@example.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=True, css_junk='z') u'Send email to <a class="rot13" data-href="znvygb:grfg@rknzcyr.pbz">test@<span class="z">no</span>examp<span class="z">spam</span>le.com</a> and you are all set.' >>> scrubemail(u"Send email to test@example.com and you are all set.", rot13=False, css_junk=('z', 'y')) u'Send email to <span class="y">test@</span><span class="z">no</span><span class="y">examp</span><span class="z">spam</span><span class="y">le.com</span> and you are all set.' """ def convertemail(m): aclass = ' class="rot13"' if rot13 else '' email = m.group(0) link = 'mailto:' + email if rot13: link = link.decode('rot13') if css_junk and len(email)>3: third = int(len(email) / 3) parts = (email[:third], email[third:third*2], email[third*2:]) if isinstance(css_junk, (tuple, list)): css_dirty, css_clean = css_junk email = '<span class="%s">%s</span><span class="%s">no</span>'\ '<span class="%s">%s</span><span class="%s">spam</span>'\ '<span class="%s">%s</span>' % ( css_clean, parts[0], css_dirty, css_clean, parts[1], css_dirty, css_clean, parts[2]) else: email = '%s<span class="%s">no</span>%s<span class="%s">spam</span>%s' % ( parts[0], css_junk, parts[1], css_junk, parts[2]) email = email.replace('@', '@') return '<a%s data-href="%s">%s</a>' % (aclass, link, email) data = EMAIL_RE.sub(convertemail, data) return data
How effective is this?
As you can see, extracting a real email address from the obfuscated version is trivial — if you know what you are looking for. It’ll take just five minutes to write a scraper that runs through the site extracting email addresses. This technique is useless on large sites that are targeted by spammers. For a small site like our job board, however, standard scrapers will fail to pick up anything.
Prateek Dayal — Mar 23, 2011 5:01:21 AM — # ↩
Nice. Keep up the good work Jace. The job board looks awesome
Philip Tellis — Mar 23, 2011 9:18:35 AM — # ↩
The study mentions replacing @ and . with entities, but have you considered replacing the entire email address (including mailto:) with html entities? That has the advantage that it will work even without JavaScript.
Kiran Jonnalagadda — Mar 28, 2011 2:46:42 PM — # ↩
No, but I’d guess a bot searching for @ would know how to decode the rest too.
Kiran Jonnalagadda — Mar 28, 2011 2:47:24 PM — # ↩
Err, that’s @
Thejesh GN — Mar 27, 2011 11:30:32 PM — # ↩
Useful tutorial.
But I hate the captcha.
Kiran Jonnalagadda — Mar 28, 2011 2:45:41 PM — # ↩
Thanks. Wish there was a good alternative to the captcha. Maybe I should require authentication on comments henceforth.
Thejesh GN — Mar 28, 2011 5:13:08 PM — # ↩
There are decent spam filters, probably you can use one of them.
Kiran Jonnalagadda — Mar 29, 2011 10:40:24 AM — # ↩
Yes, but the Akismet plugin for Zine somehow refuses to work for me, and I’ve been lazy to go digging old code.