Alex's spambayes filter scripts
-------------------------------

I've finally started using spambayes for my incoming mail filtering.
I've got a slightly unusual setup, so I had to write a couple scripts
to deal with the nightly retraining...

First off, let me describe how I've got things set up.  I am an
avid (and rather religious) MH user, so my mail folders are of
course stored in the MH format (directories full of single-message
files, where the filenames are numbers indicating ordering in the
folder).  I've got four mail folders of interest for this discussion:
everything, spam, newspam, and inbox.

When mail arrives, it is classified, then immediately copied in the
everything folder.  If it was classified as spam or ham, it is
trained as such, reinforcing the classification.  Then, if it was
labeled as spam, it goes into the newspam folder; otherwise it
goes into my inbox.

When I read my mail (from inbox or newspam), I move any confirmed
spam into my spam folder; ham may be deleted.  (Of course, I still
have a copy of my ham in the everything folder.)

Every night, I run a complete retraining (from cron at 2:10am);
it trains on all mail in the everything folder that is less than
4 months old.  If a given message has an identical copy in the spam
or newspam folder, then it is trained as spam; otherwise it is
trained as ham.  This does mean that unread unsures will be
treated as ham for up to a day; there's few enough of them that
I don't care.  The four-month age limit will have the effect of
expiring old mail out of the training set, which will keep the
database size fairly manageable (it's currently just under 10 meg,
with 6 days to go until I have 4 months of data).

The retraining generates a little report for me each night,
showing a graph of my ham and spam levels over time.  Here's
a sample:

| Scanning spamdir (/home/cashew/popiel/Mail/spam):
| Scanning spamdir (/home/cashew/popiel/Mail/newspam):
| Scanning everything
| sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshs
| sshshshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshs
| sshshsssshs
|   154
|   152|                                                             
|   144|                                                             
|   136|                                                             
|   128|                                                   h         
|   120|                                                   h      s  
|   112|                             s       ss     ss s   h   s  ss 
|   104|                             ss      ss     ss sHs h   s  ss 
|    96|                           s ss   s  sH  s  ss sHs h  Sss ss 
|    88|                    h  ss  s sss ss  sH sss ssssHHhS sSsssss 
|    80|                 s sSH ss ssssss sssssH HssssHsHHHSS sSsssss 
|    72|                 ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss 
|    64|      s  s  s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss 
|    56|   s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss 
|    48|   ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss
|    40|   ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss
|    32|   ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs
|    24|   ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
|    16|   HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
|     8|   HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH
|     0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
|      +------------------------------------------------------------
| 
| Total: 6441 ham, 9987 spam (60.79% spam)
| 
| real    7m45.049s
| user    5m38.980s
| sys     0m39.170s

At the top of the output it mentions what it's scanning, and has a
long line of s and h indicating progress (so it doesn't look hung
if you run it by hand).

Below is a set of overlaid bar graphs; s is for spam, h is for ham,
u is unsure.  The shorter bars are in front and capitalized.  In
the example, I have very few days where I have more ham than spam.

Finally, there's the amount of time it took to run the retraining.

My scripts are:
  bulkgraph.py
    read and train on messages, and generate the graph

  bulktrain.sh
    wrapper for bulkgraph.py, times the process and moves databases around

  procmailrc
    a slightly edited version of my .procmailrc file

When I actually use this, I put bulkgraph.py and bulktrain.py in
the root of my spambayes tree.  Minor tweaks would probably make
this unnecessary, but as a python newbie I don't know what they
are off the top of my head, and I can't be bothered to find out. ;-)
