Skip to Content

SpamAssassin on the Math Cluster

What is SpamAssassin?

SpamAssassin is a mail filter that scans e-mail messages looking for patterns that indicate that a message is spam. It then adds some headers to the message with its results. You can use these headers to filter mail via whatever mail-filtering tool you prefer—chances are good that your chosen mail reader supports filters in some form. If you're reading mail directly on the math machines, pine and mutt definitely do, as do most of the graphical readers (e.g., Mozilla Mail, Evolution). As always, I encourage you to download your mail to your personal machine—most mail programs for Windows and Mac OS include filtering tools; if you're running a *n*x, you probably already know everything you need to about your chosen mail reader.

How Does SpamAssassin Work?

SpamAssassin applies a series of tests to each message. Each test is worth a certain number of points (positive or negative). When the tests are complete, the points are added up and the message is given a total score. That score is compared with a threshold value set in the master configuration or in the user's personal configuration file. If the score is higher than the threshold value, the message is identified as spam, and headers stating that (along with the tests that were hits) are added to the message. If the score is lower than the threshold, the message is labelled similarly, but with slightly less information about the tests that were hits.

SpamAssassin can “learn” from new messages that are identified as spam by humans. At the moment, the tools for teaching SpamAssassin haven't been installed on all machines, so the easiest way to get SpamAssassin to learn about new spam that slips through its filters is to send it to reportspam@math.hmc.edu. If possible, use the “resend to” or “dist” option in your mail client so that the message will be forwarded with its headers intact.

Also, only send spam to that address—any other mail will be lost, as the address will be running a special filter that modifies the spam database.

What Does SpamAssassin Look Like?

If you examine the headers of mail you receive, you will find new headers, such as

   X-spam-status: No, hits=0.8 required=7.5
	   tests=AWL,SUBJ_ALL_CAPS
	   version=2.53
   X-spam-level: 
   X-spam-checker-version: SpamAssassin 2.53 (1.174.2.15-2003-03-30-exp)

for a nonspam (“ham”) message, and

   X-spam-status: Yes, hits=17.6 required=7.5
	   tests=AWL,HTML_20_30,HTML_FONT_COLOR_BLUE,HTML_FONT_COLOR_GRAY,
		 HTML_FONT_COLOR_UNSAFE,HTML_MESSAGE,HTML_TAG_EXISTS_TBODY,
		 MIME_HEADER_CTYPE_ONLY,MISSING_HEADERS,MISSING_MIMEOLE,
		 MSG_ID_ADDED_BY_MTA_2,NO_COST,NO_QS_ASKED,ONLY_COST,
		 PRIORITY_NO_NAME,RISK_FREE,SEARCH_ENGINE_PROMO,TARGETED,
		 TO_EMPTY,X_PRIORITY_HIGH
	   version=2.53
   X-spam-level: *****************
   X-spam-checker-version: SpamAssassin 2.53 (1.174.2.15-2003-03-30-exp)
   X-spam-report: ---- Start SpamAssassin results
     17.60 points, 7.5 required;
     *  2.0—Sent with 'X-Priority' set to high
     *  2.6—To: is empty
     *  0.2—BODY: Only $$$
     *  1.7—BODY: Discusses search engine listings
     *  0.9—BODY: No such thing as a free lunch (3)
     *  2.0—BODY: Doesn't ask any questions
     *  1.0—BODY: Risk free.  Suuurreeee....
     *  2.8—BODY: Targeted Traffic / Email Addresses
     *  0.1—BODY: HTML has "tbody" tag
     *  1.1—BODY: Message is 20% to 30% HTML
     *  0.1—BODY: HTML font color not within safe 6x6x6 palette
     *  0.1—BODY: HTML included in message
     *  0.1—BODY: HTML font color is blue
     *  0.1—BODY: HTML font color is gray
     *  0.3—'Message-Id' was added by a relay (2)
     *  0.1—Missing To: header
     *  0.5—Message has X-MSMail-Priority, but no X-MimeOLE
     *  0.5—Message has priority setting, but no X-Mailer
     *  1.4—'Content-Type' found without required MIME headers
     *  0.0—AWL: Auto-whitelist adjustment
     ---- End of SpamAssassin results
   X-spam-flag: YES

for a spam message. (The exact rules that are triggered will be different for each message, of course.)

If you don't see these headers, your mail client may be set up to hide them by default. Check its man page or help menu for hints on how to see all of the headers (you probably won't want to leave that option on all the time, of course).

How Can I Filter Mail Based on these Headers?

If you want to treat mail identified as spam differently than nonspam, you can create filter rules that look at the X-spam-flag or X-spam-status headers, and sort the mail into a separate folder. You should check that folder every once in a while to make sure you don't have real mail that's been misidentified as spam, and then delete the spam messages.

Exactly how to filter mail depends on the mail client you use. The e-mail section includes some pointers for using Procmail, but many mail clients have built-in mail filtering that works just about as well, and is easier to set up. Mail-filtering tools are pretty much standard in most graphical mail clients, but terminal-based mail clients such as Pine and Mutt have them, too.

sample_procmailrc is a very basic procmailrc file that saves spam messages to different folders based on how highly they score. You can use this file by saving it as .procmailrc in the root of your home directory.

Some of My “Ham” Mail Is Being Tagged as Spam!

If you find that messages are being tagged as spam that aren't spam, you can send them to me to look at or you can tinker with your personal configuration for your mail by editing the file ~/.spamassassin/user_prefs, which will be created automatically when SpamAssassin first runs on a message sent to you. You can get all the details by running

   % man Mail::SpamAssassin::Conf

or by taking a look at a web page from the University of Waterloo that discusses some of the most interesting configuration options.

The simplest change is to add a particular address to a “whitelist”, which includes addresses that will always be assumed to be sending “clean” mail, even if some of it does look kind of spammy.

If there's enough demand, I can set up a mailbox similar to the reportspam@math.hmc.edu address that would assume that any mail sent to it was not spam.

Training the System

For maximum performance with your mail, I highly recommend that you train the system on known ham and spam mail from your own collection of received mail. By default, SpamAssassin applies general rules that may not apply to your mail. Training it on your actual mail will help the system to understand the kinds of mail that you receive, and what you consider to be ham or spam.

To train the system, perform the following steps:

  1. Sort your mail into (at least) two mailboxes, one containing known spam, and one containing known ham messages
  2. Run the following command on your ham mailbox file:
    sa-learn --ham --mbox --showdots ~/mail/hammail
    where hammail is the name of your ham mailbox. Repeat for other ham mailboxes if you have them.

    Depending on your mail client, your mail may not be in ~/mail, and might even not be in mbox format. For example, if you use Mozilla Mail, your e-mail is stored in mbox files located in a directory such as ~/.mozilla/default/xrx9hd1u.slt/Mail/, where default is your user profile name (which might be different if you set it differently when first starting Mozilla), and the xrx9hd1u will be some random string. If you have problems finding your e-mail, please ask for help.

  3. Run the same command on your spam mailbox, but substitute --spam for --ham; that is
    sa-learn --spam --mbox --showdots ~/mail/spammail

Tuning

After you've been running SpamAssassin for a while and you've primed your database with sets of hand-selected spam and ham messages, you might notice some spam that still gets through.

Unfortunately, no system is perfect, and when you combine that with the spammers doing their best to defeat the filters thrown up in front of them, the default weights for SpamAssassin's rules might not be enough to catch all the spam you get.

You can tune your SpamAssassin rules by adjusting the scores assigned to messages that trip each rule. Before you do so, you should probably do some analysis of the spam you get that doesn't get classified as spam by SpamAssassin. The X-Spam-Status header includes the names of the tests that were positive for the message. For a message that's classified as spam, SpamAssassin will also give you a rundown of the points that each rule contributed to the total. Unfortunately, it only does the tallying for messages it decides were spam; finding out what a message not identified as spam would have scored is up to you.

The tests are spelled out in .cf files in /usr/share/spamassassin. The scores are in the 50_scores.cf file in that directory. Just grepping for a rule name on everything in the directory (e.g., grep HTML_MESSAGE /usr/share/spamassassin/*) should give you everything you need.

The 50_scores.cf file lists four separate scores for most rules (see the man page for why); practically speaking, you'll only need one.

After an analysis of my mail, I decided to bump the scores up for some rules by adding the following lines to my preferences file, ~/.spamassassin/user_prefs:

score HTML_MESSAGE 3.0
score MIME_HTML_ONLY 3.50
score DRUGS_ERECTILE 3.0
score DRUGS_MANYKINDS 2.0
score DRUGS_PAIN 2.0
score DRUGS_DIET 2.0
score DRUGS_ANXIETY 2.0
score DRUGS_MUSCLE 2.0
score BAYES_50 2.0
score BAYES_60 2.5
score BAYES_95 4.0
score BAYES_99 5.0
score RAZOR2_CF_RANGE_51_100 2.0
score RAZOR2_CHECK 2.0
score HELO_DYNAMIC_DHCP 4.0
score HELO_DYNAMIC_SPLIT_IP 3.880

The BAYES_99 rule is triggered if the Bayes analysis comes back with a 99% chance of the message being spam. At this point I'm feeling pretty trusting, so I bumped the score up to 5.0, which is enough to make the message spam if it doesn't match any other tests. (If it has negative scores from matching good rules, it won't end up as spam.)

The HTML_MESSAGE rule is positive if the message has a bunch of HTML in it. Prior to using SpamAssassin, I used to have a Procmail rule that assumed that any HTML mail was spam. That rule was right about 99.9999% of the time, so it ouught to get a higher score.

I gave an even higher score to messages matching the MIME_HTML_ONLY rule—messages containing only HTML, with no text part. Those are almost certainly spam (or from people who don't know how to use e-mail, which is almost as bad).

The rest of the rules whose scores I modified are things that I thought should have higher scores. I'm never going to be buying erectile dysfunction drugs, and I really don't need to see those messages. Ever. I'm sure there are other, similar, rules that I might consider tweaking in the future.

A Note on Message Integrity

The approach that we are currently taking aims at providing useful information about the probablility of a particular message being spam. That information is communicated via headers that are added to the message. No existing headers are being modified, and messages identified as probably being spam are still delivered to their recipients for action.

We could take more intrusive actions, such as modifying the subject header to contain a string such as ***** SPAM ***** or piping messages identified as spam directly to /dev/null. Because these filters can't possibly sort spam from nonspam perfectly, I prefer to err on the side of safety and allow end users to determine what they want to do with their mail. Adding headers gives you some hints about the content of a particular message—what you choose to do with the message based on that information is up to you.