SpamAssassin is an Inadequate Technology

SpamAssassin is great. It stamps out much of the efforts of the criminal and the abusively greedy to waste your time, overload mail systems, and otherwise degrade the utility of email, one of the most successful communication infrastructures ever developed. The difference between what lands in your inbox with and without SpamAssassin is night and day. And yet, and yet...

Criminals and the Abusively Greedy Define Allowed Discourse

SpamAssassin is ubiquitous, a testament to just how useful it is for the average individual. You wouldn't consider setting up a mail server without it, unless it is a part of a much larger anti-spam system, and that system probably incorporates SpamAssassin or something so similar as to make no practical difference. The community and software of spam suppression reacts to whatever it is that criminals and abusively greedy marketers are putting into their emails, and that ultimately filters down into SpamAssassin rules. Unfortunately these rules can't tell the difference between legitimate discussion and malicious content. They are just rules.

SpamAssassin rules (and their very similar equivalent in other systems, all of which react to the same abusive patterns of behavior) in effect define the speech that is disallowed for everyone. The outcome is thus that, thanks to the hard work of well meaning people intent on defending themselves and others, it is criminals and abusively greedy individuals who set the boundaries for disallowed speech online. This is philosophically problematic, but that's somewhat beside the point when it comes to practical outcomes.

Examples From the Field

For much of the past fifteen years I have run a modestly sized mailing list that covers a range of topics impacted by the evolution of spam and anti-spam initiatives: aging, biotechnology and medical research, biotechnology investment, and similar fields. I follow best practices for list management and provide a good service for people interested in this topic. I have to be very careful with content, as does everyone these days: if any email I send out to subscribers fails to have a near-zero SpamAssassin score, then lasting deliverability issues will result with the large email providers. (I've found Mail Tester to be a great service, but if you have the time and interest it is also possible to pull together local scripts to test content against SpamAssassin prior to putting mail into the queue). Even a single SpamAssassin score point resulting from content can be enough to cause issues with the larger anti-spam systems.

Given my topics, near every outgoing mail has something that must be rewritten or redacted. Sometimes this is easy, sometimes this is simply impossible, and a portion of the content is simply removed. This is always troubling and annoying. The anti-spam infrastructure is good, but it is suppressing legitimate and useful discussion. Some examples follow:

We Can't Talk About Money Without Being Very Indirect

The ebb and flow and specific examples of investment in biotechnology are very important in any discussion of medical progress. Since fraud online is all about money, it is very hard to work around the all of the rules about money and words relating to money. You can't mention specific amounts without stripping out all mention of currency, or dollar signs, for example. SpamAssassin has been getting better on this front, but other large anti-spam systems have a hair-trigger on any mention of amounts of money.

Further, words such as invest, profit, partnership, and share, all of which are very hard to avoid when talking about the biotechnology business world in any meaningful way, are very likely to result in a large spam score either on their own or in conjunction with other common phrases.

And heaven forbid you might actually be raising money for a charitable cause, such as funding specific projects in early stage life science research via a 501(c)3 foundation. The language of donation and fundraising is whole other minefield of common single words and short phrases that will get your emails flagged and your future deliverability degraded with the large email providers.

We Can't Talk About the Science of Drugs

A surprising amount of scientific research involves delving through the long list of approved medications in search of new uses and yet to be cataloged effects. Since most drugs have a very broad set of effects on cellular biochemistry, and were approved for use long before the advent of technologies capable of better identifying all their modes of action, it is frequently the case that entirely new uses can be found for old drugs. Unfortunately, since so much of the spam industry is devoted to fraud regarding sales of prescription drugs, it is next to impossible to talk about legitimate research associated with any of the drugs involved; how can you talk about a drug without mentioning its name at all? This is particularly frustrating now that a number of muscle-related drugs that are common offenders in the spam blacklists are being shown to have interesting effects on mechanisms linked to regeneration and aging. Yet putting any drug name in your email is rolling the dice unless you've cleared it with the SpamAssassin rules first.

We Can't Talk About a Lot of Other Science, Either

Did you know that the word soma has several specific meanings in cellular biochemistry? Beyond referring to a specific portion of cellular structure, it also appears in many modern papers on stem cells and regenerative medicine as a shorthand for the proportion of tissues made up of somatic cell populations. Yet the drug with the brand name Soma is in the SpamAssassin rules, and thus these four letters are enough to block delivery.

header SUBJECT_DRUG_GAP_S	Subject =~ /\bs.{0,1}o.{0,1}m.{0,1}a\b/i
describe SUBJECT_DRUG_GAP_S	Subject contains a gappy version of 'soma'

#soma
body __DRUGS_MUSCLE1	/(?:\b|\s)[_\W]{0,3}s[_\W]{0,3}[o0\xF2-\xF6][_\W]{0,3}m[_\W]{0,3}[a4\xE0-\xE3\xE5\xE6@][_\W]{0,3}(?:\b|\s)/i

Growth hormone is widely studied in the biology of aging, connected to insulin metabolism. Some of the longest-lived mice produced by scientists are those genetically modified to lack growth hormone receptor. But thanks to the spam surrounding the fraud-riddled human growth hormone marketplace, it certainly used to be very hard to have a conversation about this research. Thankfully this is one that has become better of late, an example of the shifting bounds of fraudulent conversation moving on to other topics, and the SpamAssassin and other anti-spam rules relaxing as a result.

Further, scientific discussion uses long words. SpamAssassin looks for sequences of long words, as they don't show up in normal conversation. Unless you happen to be having a technical conversation. So it is frequently the case that random sections of a quoted scientific summary will be flagged by SpamAssassin rules simply because the words are long.

body __LONGWORDS_A	/\b(?:[a-z]{8,}[\s\.]+){6}/
body __LONGWORDS_B	/\b(?:[a-z]{6,}[\s\.]+){9}/
body __LONGWORDS_C	/\b(?:[a-z]{5,}[\s\.]+){10}/

Many More Examples Exist

What if you are a part of the legitimate online poker industry, focused on sweepstake tournaments and free poker games? That's a painful place to be when trying to send confirmation emails to your users. Or how about writing about the stock market? Or managing a newsletter for clockmakers and mechanical watch enthusiasts? There are many fields of discussion that are greatly impacted by the ongoing fight between spammers and spam suppression organizations. Most of these are too small to have any recourse when email is not delivered, or when they are blocked for long periods of time by major email services.

Long on Complaints, Short on Solutions

This post is a complaint and provides no suggested solutions, as the problem is a challenge. The people who are impacted do not have a loud enough voice and do not represent a large enough source of revenue for the largest anti-spam initiatives to pay attention. Collateral damage to deliverability is accepted in the anti-spam community, and only those with the ability to be a particularly squeaky wheel will have their issues addressed.

The broader anti-spam ecosystem long ago moved past content rules as the only way to decide on what to block, but nonetheless rule-based systems like those of SpamAssassin are still a very large contribution to that decision, applied in a blanket fashion to emails that arrive in bulk - such as from mailing lists. The big anti-spam operations are unfortunately very opaque in how they act on content, and it is hard to say what is going on under the hood save through experimentation. SpamAssassin is a useful guide as to general areas of concern, but it is still the case that what Gmail or Yahoo chooses to block based on content is only going to be broadly similar to the SpamAssassin outcome, not the same.

It seems to me that another evolution of anti-spam infrastructure is still very much needed, one in which it is possible to prevent much more of the collateral damage caused by content detection rules. This may require a completely different approach, such as the long-awaited delivery micropayment model, to pick one example - the recognition that it is the free nature of the system that enables spam, and that solutions require changing the incentives. The development community feels very strongly about the righteousness of defeating the frauds and sociopaths who abuse the email system, but these foes are not defeated for so long as their actions effectively determine what is allowable speech online. There is much work left to accomplish here.