| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

The Davalos Affair For Idiots; Not The Onion: Google Fireside Chat w Kissinger; Bicycle Wheels, Inertia, and Energy; Another Tax Fraud; Google's Borg; A Verion That Redirects To Local HTTP Server; Spanish Accents For Idiots; Aluminium Cans; Advice on Spray Painting; Female View of Online Chat From a Male; UX Reading List; S4 Subgroups - Geometric Interpretation; Fucking Email; The SQM Affair For Idiots; Using Kolmogorov Complexity; Oblique Strategies in bash; Curses Tools; Markov Chain Monte Carlo Without all the Bullshit; Email Para Matias Godoy Mercado; The Penta Affair For Idiots; Example Code To Create numpy Array in C; Good Article on Bias in Graphic Design (NYTimes); Do You Backup github?; Data Mining Books; SimpleDateFormat should be synchronized; British Words; Chinese Govt Intercepts External Web To DDOS github; Numbering Permutations; Teenage Engineering - Low Price Synths; GCHQ Can Do Whatever It Wants; Dublinesque; A Cryptographic SAT Solver; Security Challenges; Word Lists for Crosswords; 3D Printing and Speaker Design; Searchable Snowden Archive; XCode Backdoored; Derived Apps Have Malware (CIA); Rowhammer - Hacking Software Via Hardware (DRAM) Bugs; Immutable SQL Database (Kinda); Tor GPS Tracker; That PyCon Dongle Mess...; ASCII Fluid Dynamics; Brandalism; Table of Shifter, Cassette and Derailleur Compatability; Lenovo Demonstrates How Bad HTTPS Is; Telegraph Owned by HSBC; Smaptop - Sunrise (Music); Equation Group (NSA); UK Torture in NI; And - A Natural Extension To Regexps; This Is The Future Of Religion; The Shazam (Music Matching) Algorithm; Tributes To Lesbian Community From AIDS Survivors; Nice Rust Summary; List of Good Fiction Books; Constructing JSON From Postgres (Part 2); Constructing JSON From Postgres (Part 1); Postgres in Docker; Why Poor Places Are More Diverse; Smart Writing on Graceland; Satire in France; Free Speech in France; MTB Cornering - Where Should We Point Our Thrusters?; Secure Secure Shell; Java Generics over Primitives; 2014 (Charlie Brooker); How I am 7; Neural Nets Applied to Go; Programming, Business, Social Contracts; Distributed Systems for Fun and Profit; XML and Scheme; Internet Radio Stations (Curated List); Solid Data About Placebos; Half of Americans Think Climate Change Is a Sign of the Apocalypse; Saturday Surf Sessions With Juvenile Delinquents; Ssh, tty, stdout and stderr; Feathers falling in a vacuum; Santiago 30m Bike Route; Mapa de Ciclovias en Santiago; How Unreliable is UDP?; SE Santiago 20m Bike Route; Cameron's Rap; Configuring libxml with Eclipse; Reducing Combinatorial Complexity With Occam - AI; Sentidos Comunes (Chilean Online Magazine); Hilary Mantel: The Assassination of Margaret Thatcher - August 6th 1983; NSA Interceptng Gmail During Delivery; General IIR Filters; What's happening with Scala?; Interesting (But Largely Illegible) Typeface; Retiring Essentialism; Poorest in UK, Poorest in N Europe; I Want To Be A Redneck!; Reverse Racism; The Lost Art Of Nomography; IBM Data Center (Photo); Interesting Account Of Gamma Hack; The Most Interesting Audiophile In The World; How did the first world war actually end?; Ky - Restaurant Santiago; The Black Dork Lives!

© 2006-2015 Andrew Cooke (site) / post authors (content).

Efficient Spam Filtering With Mutt and SpamAssassin

From: andrew cooke <andrew@...>

Date: Fri, 12 Mar 2010 11:11:45 -0300

I've finally got my spam rates down to GMail levels - effectively none.
Here's how to do it.  This is a bit long and detailed, but it presents most
details of a coherent system that works well for me.


First, get Spamassassin installed and working.  In OpenSuse this means
installing the relevant packages.  I run spamd as a service and then use spamc
to call that.  This avoids the overhead of starting Spamassassin each time an
email arrives.

One reason GMail can filter spam so efficienctly is that it can detect when
many people get the same email.  On a local system you can also do this in
three different ways.  The first way is to use Vipul's Razor.  This is a
centralized service allows you to pool resources with many other users.  It
works with Spamassassin, but needs to be separately installed and configured.

Vipul's Razor is also an OpenSuse package.  Instructions on configuring it
with Spamassassin are here - http://wiki.apache.org/spamassassin/RazorSiteWide

The second way to exploit data from other emails is to use an external DNS
blacklist.  By default, Spamassassin is configured to not use external source
of data (like Vipul's Razor and DNS blacklists).  To change this, edit the
flags in /etc/sysconfig/spamd (this is an OpenSuse specific detail - other
distros will use a different mechanism).

I have: SPAMD_ARGS="-d -c --allow-tell"

(I'll explain --allow-tell later; the important thing here is that -L has been
removed).

Also, in /etc/mail/spamassassin/local.cf, I have:

# Enable the Bayes system
use_bayes               1

# Enable Bayes auto-learning
bayes_auto_learn              1

# Enable or disable network checks
skip_rbl_checks         0
use_razor2              1
razor_config /etc/mail/spamassassin/razor/razor-agent.conf


The Bayes mentioned above allows Spamassassin to "learn" what email is good
and what bad.  Again, I will describe Mutt macros that help with this later.


Next, we need to configure procmail to call Spamassassin and then filter
spam.  To do this with Mutt I use the following mail folders (I am using
maildir; you can do something similar with mboxes):

.spam - this is where I put questionable emails.  These are borderline emails
and this folder needs to be checked regularly by hand (later I will describe
how Mutt macros can simplify this process).

.0-spam - this is where I put emails that were detected as spam, but which are
not "super obviously bad".  When starting, this folder also needs to be
checked regularly (see discussion of mailing lists below), but once everything
is working, it can be left pretty mcuh unattended - it then works as an
emergency backup so that if you incorrectly filter something, you can still
retrieve it.

/dev/null - this is where I send "super obvious" spam.

.learn-spam - this is used for Bayes (see later)

.learn-ham - this is used for Bayes (see later)

Given those, my .procmailrc file looks like this:


MAILDIR=$HOME/mail
DEFAULT=$MAILDIR/ 
LOGFILE=$HOME/log/procmail.log
LOGABSTRACT=all               

# get spamassassin to check emails
:0fw: .spamassassin.lock
* < 256000              
| spamc                 

# strong spam are discarded
:0                         
* ^X-Spam-Level: \*\*\*\*\*\*
/dev/null                    

# weak spam are kept just in case - clear this out every now and then
:0                                                                   
* ^X-Spam-Level: \*\*\*\*\*                                          
.0-spam/                                                             

# if it wasn't detected as spam, but is to a fake address, then we
# know it is spam, so learn from that                             
:0                                                                
* !^(From|To|cc|bcc)[ :].*(compute|andrew|root|webmaster|admin|postmaster).*@acooke\.org
* !^(From|To|cc|bcc)[ :].*@isti\.com
# add mailing lists below
* !^From[ :].*(snowmail_daily@...|Section@...|rforno@...|alert@...).*
{
  # save in case of screw-ups, mailing lists, etc
  :0 c
  .0-spam/
  :0
  .learn-spam/
}             

# otherwise, marginal spam goes here for revision
:0                                               
* ^X-Spam-Level: \*\*                            
.spam/                                           


Earlier I said there were three ways to detect spam using emails to other
people.  The third way is the "fake address" trick above - I download all
email from my ISP that is addressed to acooke.org, even though I know that
only a few addresses are actually valid.  I then use email to invalid
addresses as an extra source of known spam.


With the above configured you should see Spamassassin being called correctly
in the logs (and Vipul's Razor being used too).


Next, some Mutt macros that help simplify all this:

macro index S "<tag-prefix><save-message>=.learn-spam<enter>" "move to learn-spam"
macro pager S "<save-message>=.learn-spam<enter>" "move to learn-spam"
macro index H "<tag-prefix><copy-message>=.learn-ham<enter>" "copy to learn-ham"
macro pager H "<copy-message>=.learn-ham<enter>" "copy to learn-ham"


These are used together with these crontab entries:

*/3 * * * * /home/andrew/bin/spam
*/3 * * * * /home/andrew/bin/ham


And these scripts (this is why --allow-tell was needed for spamd - it lets
these scripts update the server with new information):

> cat spam
#!/bin/bash

for f in `ls /home/andrew/mail/.learn-spam/cur`
do
    spamc -L spam < "/home/andrew/mail/.learn-spam/cur/$f" > /dev/null
    rm "/home/andrew/mail/.learn-spam/cur/$f"
done
for f in `ls /home/andrew/mail/.learn-spam/new`
do
    spamc -L spam < "/home/andrew/mail/.learn-spam/new/$f" > /dev/null
    rm "/home/andrew/mail/.learn-spam/new/$f"
done

> cat ham
#!/bin/bash

for f in `ls /home/andrew/mail/.learn-ham/cur`
do
    spamc -L ham < "/home/andrew/mail/.learn-ham/cur/$f" > /dev/null
    rm "/home/andrew/mail/.learn-ham/cur/$f"
done
for f in `ls /home/andrew/mail/.learn-ham/new`
do
    spamc -L ham < "/home/andrew/mail/.learn-ham/new/$f" > /dev/null
    rm "/home/andrew/mail/.learn-ham/new/$f"
done


The idea here is that anything moved to .learn-spam (by pressing the S key) is
then learnt by the system as spam, while anything copied to .learn-ham is
learnt as ham (non-spam).  Note that S also deletes files.

In practice this means that you can use S to delete and files in your inbox,
or in .spam, and the system will learn from them.  Similarly, if you see
something in .spam or .0-spam that should not be there, you can use H to
"unlearn" it (you must then also copy it manually to wherever you want to keep
it).


Finally, a note on mailing lists.  When you subscribe to a new mailing list it
will not be listed in the .procmailrc above, and so will be sent to .0-spam.
You'll realise that the email is missing, fix procmailrc, and use H + copy to
correct things.  That's a nuisance, but it happens quite infrequently so I
haven't tried to simplify it.

Oh, and also, flag a pile of known good emails as ham.  Without this it takes
teh Bayes system a while to get started.

Andrew

Comment on this post