Yesterday, in preparation for the deletion of over 3 years of spam backlog, I did a bit of a study of my spam. Mostly this was directed at the efficacy of my spam filter, but I looked a bit into overall nature of my spam also. I lamed out on the geek end of it and used searches in Thunderbird instead of doing any scripting, which probably made it take considerably longer.
I have a bit over 25 thousand spams in my spam folder, the vast majority of which (all but about 20) date since my last purge in June 2004. This compares with about 2500 (one-tenth so many) non-spam emails in my inbox and archives (2 high-volume mailing lists not included, but minor mailing lists included). My system administrator has put quite a bit of attention over the years into combating spam, and in the last several years (at least since 2004) has been using the Bogofilter bayesian spam filter. For those not familiar with Bogofilter, you "teach" it to recognize spam by telling it which mail is spam. Based on a statistical analysis of the the difference between your spam and non-spam, it slowly takes over the spam-determining. You correct it when necessary, and theoretically over time the corrections become rarer and rarer. This has been deemed to be better than less flexible systems, because the spammers are adaptable, and over time, adjust the format and content of their emails to overcome those less flexible systems.
One further note: I also use procmail to filter out a small number of mailing lists and vendor emails, and these bypass the Bogofilter. These add up to a relatively small number of emails, though.
So, here are some of my results.
Spam Volume
Volume has risen significantly since mid-2004, although it is very irregular. There do not appear to be any seasonal patterns. The monthly average has gone from around 300 in late '04 to over 1000 in '06 and '07. The most significant growth was in 2005, where the volume grew by almost 50 emails per month on average; in the last year or so the growth has decreased to around 10 per month.
In 2004, less than 60% of my incoming mail was spam. In the last year, around 90% was spam.
Addressee
The vast majority of my spam (at least 89%) is addressed to my oldest domain, parts-unknown.com, which is also the domain I use in my return address. My taska.org domain, which has a small web presence, and which I use when I need to give out my email address verbally, accounts for at least 6% of my spam. (These numbers are probably higher because I did not investigate whether mails to "undisclosed recipients" were included.)
My system administrator has set up a special subsection of taska.org, "s.taska.org", for use in "high-spam-risk" situations, for example when we don't trust an unknown vendor not to sell our email addresses. Of the hundreds of address I have created this way, only one has slipped into the spam world: "orkut@s.taska.org", which has itself only been used for 5 spams.
Some 81% of my spam is addressed directly to "taska". 7% is addressed to "webmaster", which points to me. The rest are mostly doing strange things with their To: lines.
My conclusion is that the following things put an email address at risk for spam:
- Age and volume of use. Over time and use, there is an increasing chance that someone you have sent an email to will get a email-address-collecting spam-virus.
- Web exposure. If your domain has a web presence, and your email ID is short or guessable (let alone if you put your email address out on a public website), you will get more spam.
Things like putting your email into vendors' webforms seem to have less effect. ("Pseudospam", when you "accidentally" get signed up for email you don't want from vendors you've actually dealt with, is another story, which I'll discuss later.)
Efficacy of Bogofilter
Bogofilter has definitely improved over the years. It went from catching about 75% of my spam in 2004 to catching more than 90% in the last year. In the last year it also "caught" a small number of non-spam emails, incorrectly identifying them as spam. However, it may actually just be that Bogofilter is smarter than me, as we will see:
Bogofilter assigns every email a "bogosity" score between 0 and 1. After some experimentation, our bogofilter has settled down to defining email with bogosity less than 0.45 to be "Ham", 0.99 and up as "Spam", and everything inbetween as "Unsure". In our system, bogofilter only files away the "Spam" emails, automatically; "Unsure" and "ham" get sent to my Inbox. After tossing the "missed" spams into my "bogofilter teaching" folder, I typically sort my non-spam, non-list email into "Personal" (from friends and familyl) and "Vendor" (from people I do business with). Over the past few years, 3% of my mail has been personal, 5% has been from vendors, and over 90% has been spam.
In 2006, 72% of my personal email was correctly identified by bogofilter as "ham". 28% was marked "unsure", a total of 122 emails (no personal emails were incorrectly marked as spam).
Bogofilter did just as well with my spam. 91% was marked correctly. 9% was marked "unsure" (a total of 1224 emails), and only one was marked "ham".
Bogofilter has a much harder time with my "vendor" mail. Only 36% was marked as "ham"; 62% was marked "unsure", and 2% (12 emails) as spam.
The highest-bogosity "non-spam" mail has consitently been from vendors (like Yahoo, and Lands End), who signed me up for "opt-out" mailing lists without making that clear to me. In fact, all of the very "high bogosity" non-spam I have received falls into this category. Only once, in the very early days, did Bogofilter ever designate a "real" personal email as spam. (It was a party invite from someone I didn't know very well.) The only high-bogosity vendor mail that I wouldn't want to get lost seems to be from linked-in (Bogofilter seems to have a grudge against them).
This has inspired me to go through my highest-bogosity non-spam email every few months to weed out the unwanted pseudospam (which luckily usually has a working unsubscribe method) and the consistently high-bogosity, but wanted, vendors (which I can procmail filter).
Pseudospam, in general, is highly annoying, but I find that the vast majority of pseudospammers (my most common ones are online versions of paper catalogs and hotels) do have functional opt-out systems. Least likely to have opt-out systems are tiny vendors like the proprietors one small beach hotel I visited, who probably use their Outlook addressbook to send out their regular mails. Annoying, but not annoying enough for me to send them an email demanding to be removed from their list :).
Over all, I'm very happy with the success of Bogofilter. I would probably be happy if Bogofilter tucked away all of my spams over 0.90 bogosity, instead of 0.99 (this would reduce the false-negatives by about 500 a year, with less than a dozen, not very important, additional false positives), but it's a pretty tiny issue.
Spam Topics
I did a small study of these. They're tough to study because spammers do their best to avoid keywords (it's too easy to filter by them). Some popular keywords:
Sex: 1302 hits
Job: 1222 hits
Stock: 785 hits
In the 400-500 range: account; drug; loan
Alright, I think that's plenty! It probably took me about 10 hours to do the analysis, and I have other things to do. I hope my system administrator is pleased with the removal of 25,000 emails from my mailbox.
technorati tags:
bogofilter,
spam analysis Blogged with
Flock