Javascript Obfuscation of a Mailto Link

October 19th, 2010 Permalink

Programmers have their tales, much like sailors. One in particular concerns the ravening schools of email address scrapers that swim through the web, tearing each juicy morsel from the safety of its page, then conveying it down to the spam masters who dwell in the frigid, dead, depths. So it is that no wise person puts a plain email address out on the web - unless of course they like being buried in spam. The oldest of my personal addresses, that have been seated upon webpages continuously for going on a decade now, are floating around the 99% spam mark. No joke, that.

One widely used method of defending against scrapers is not to use mailto links at all. Instead print an obfuscated address in some format that a human can use to derive the real address - but which a scraper will not understand. "fake.address -at- exratione -dot- com" for example. This seems like giving up, however.

To defend against scrapers while keeping a functional mailto link, you'd have to think a little about the capabilities of the scraper you want to fool.

Naive HTML scraper

A naive scraper grabs the page HTML and parses out email addresses. No processing, very simple, very fast. I don't doubt there are many, many of these things roaming around out there. It's easy to ensure that a naive scraper doesn't find your email address - just make sure it doesn't appear in the HTML. For example:

<div id="mailto">
<a href="mailto:this.will.be.replaced.by.javascript@exratione.com">Email Reason</a>
</div>
<script type="text/javascript">
$(document).ready(function() {
    var x = $("#mailto a").attr("href");
    $(this).attr("href", x.substr(0,7) + "reason" + x.substr(42));
   });
});
</script>

Javascript-enabled DOM scraper

A scraper with bigger teeth would process a page into the DOM, straightforwardly run all of the Javascript on the page, and then grep the final state of the DOM for email addresses. This sort of technology is necessary for a variety of entirely legitimate uses in this AJAX-fixated age - but you know what they say. Give a man a hammer and he'll figure out how break windows with it.

This crude form of Javascript-enabled scraper can be fooled by requiring some sort of user input in order for the email address to be created in the DOM. For example:

<div id="mailto">
<a href="mailto:this.will.be.replaced.by.javascript@exratione.com">Email Reason</a>
</div>
<script type="text/javascript">
$(document).ready(function() {
   $("#mailto a").mouseover(function() {
      var x = $(this).attr("href");
      if(x.length > 27) {
         $(this).attr("href", x.substr(0,7) + "reason" + x.substr(42));
      }
   });
});
</script>

DOM scraper with event processing

If you are the author of an email address scraper whose tool is already setting up a DOM tree and processing arbitrary Javascript, why not go one step further and trigger all the event-bound functions on the page as well? Traverse the DOM in order, and consider events in a user-like order on each element (i.e. don't mousedown before mouseover, that sort of thing). I would imagine that the programmers working on search engine support for AJAX already maintain far more sophisticated tools than this.

Even a simple event processing approach to an email address scraper would circumvent the jQquery snippet above, needless to say.

Economic considerations

The question here is whether I can quickly think of a way to distinguish between a comprehensive DOM-manipulating scraper and an actual human being. Anything more than a little thought is probably not worth it, and here is why: firstly there are a lot of unhidden email addresses out there to be scraped, and secondly there are more cost-effective means of harvesting email addresses than the use of sophisticated scrapers. Both are strong incentives for an address gatherer not to try too hard at spidering and scraping.

While I'm certain that sophisticated DOM and Javascript aware web spiders exist and are presently operating - at Google at the very least - the computational requirements mean that in comparison to simple scrapers they are slower and more expensive to operate per page examined. The important question is whether or not they remain cost effective in terms of email address discovery: does the additional effort pull in enough email addresses per unit time to make it worth it? That is interesting to speculate on, a line of thinking that touches on changes in hardware cost and the spreading use of DOM-manipulating Javascript on the web, amongst other items.

Now consider that amongst the people who have your email address in their address book, half a dozen have probably already cheerfully uploaded your address to a one or more of grasping online services, or fall victim to some other address book pillaging scam. Most of the people you send email to will have no incentive to keep your address private, and will hand it over to third parties without any conditions placed upon its use. From that starting point there are a hundred ways for an email address to make its way to the black hats and spam houses.

That said, if five minutes of thought and a few lines of code means that the least sophisticated scrapers can't add my mail to their stockpiles, at least I'll have the comfort of knowing that it was probably some other fool's fault once the spam starts to trickle in.