Cruising through my mailbox just now, I happened to glance at a piece of spam before deleting it, and did a double-take:
The reason I was startled is that the kawaii address (for feedback from my "Chibi Jesus" page) is one that I exempted from spam filtering about a year back. I chose an unused address at my domain, did not use it for any purpose or attach it to any outbound mail, and published it nowhere except for a single web page, where it was protected from spam filtering by the Javascript munging
recommended by Project Honey Pot.
Here, as of a month ago when it wasn't being spammed, was
the only reference (WARNING: HIDEOUS MIDI MUSIC) to that address on the Web:
(An e-mail link here has been hidden in Javascript. If you have Javascript turned off, please use the contact form linked at the bottom of the page.)
Should be pretty freakin' bulletproof, right? After all, as Project Honey Pot noted with no apparent sense of irony, "It should be noted that both of these techniques are likely to remain sound for some time to come. Harvesters that interpret the Javascript on every page they encounter would face a substantial risk of getting stuck in infinite loops or crashing due to malformed Javascript. ... This is likely beyond the current computing power of a legitimate company like Google."
The problem is that, if a legitimate company like Google does apply the computing power to it, the spammers don't have to expend the effort: they merely have to crawl the Google results.
And, alarmingly, this seems to be what has started to happen.
At first I thought that the address had been either guessed or else reposted somewhere, and I ran a
Google search for kawaii@tomorrowlands.org in order to explore this. The only result to pop up was my own page, and the text summary of the page read:
The source code of the Google results page shows the address bare: "kawaii@tomorrowlands.org"
The source code of the Google-cached page is identical to mine (i.e. no raw address; the Javascript is preserved); the cache was taken May 12, 2009. It appears that the caching itself doesn't break the munging. There must be something about the excerpting process that does the trick.
At first I couldn't believe my eyes. Was this coincidence? I went through my recently deleted e-mail and rechecked all of the spam headers.
I have a similar whitelisted-and-munged address I use only for
WikkaWiki announcements, that has only been
posted on my own wiki, protected similarly. It has also started receiving spam, and investigation turned up the same results. At this point the evidence is pretty damning.
The first spam I still have for kawaii was on June 8; it's likely that Google's behavior change dates from before then, and the spammers are only now beginning to take advantage of this new potential. The spam started slowly and is now up to several messages per day -- word is probably spreading amongst the bad guys.
So.
Webmasters: Time to re-spamproof your site. A damn useful tool has just dropped out of the toolbox.
PLEASE NOTE: I have disabled the e-mail address referred to by this post. To contact me regarding this post, please write to [the first three letters of this journal name] [the dash symbol, '-'] [mail] [at-sign] tomorrowlands [dot.] org, or leave a comment below.
UPDATE: Two pieces of additional information I'd like to pull out from comments:
1. Even though the sample search I provided was for the compromised e-mail address, the spammer does NOT need to previously know your e-mail address in order to Google it. They just have to search for things shaped like e-mail addresses and skim the cream of the results.
[*] 2. There is
anecdotal evidence that pages which pull their decode function from a separate .js file have not been broken. (Yet.)
UPDATE 2: Welcome to /. readers!
More discussion in the Slashdot thread.