- Download, unpack.
(Although some docs give a 2009 release as the latest, this one
is available, so went with it.)
wget -nv http://crm114.sourceforge.net/tarballs/crm114-20100106-BlameMichelson.src.tar.gz
tar xf crm114-*
cd crm114-*
- Download required library tre, unpack.
It's also available separately and for some distros, but this
version is linked from the crm page, so went with it.
wget -nv http://crm114.sourceforge.net/coolthings/tre-0.7.5.tar.bz2
tar xf tre-*
cd tre-*
- Build and install tre. For simplicity: static only, and just install
in our working directory.
./configure --prefix=`cd .. && pwd`/tre --enable-static --disable-shared &&
make &&
make install &&
cd ..
- Edit the crm Makefile and build: this crm-makefile.patch makes the
destination be /usr/local/crm (didn't bother with 114 for
consistency with the binary name, and didn't bother with version number
since it seems no new release is forthcoming); compiles and links with
the local tre library we just installed here; omits completely static
linking since I do not have static versions of all needed libraries and
it seems unnecessary. As far as I've seen, the current sources
work ok on 64-bit systems; the docs are inconsistent about this.
patch <crm-makefile.patch
make
- The install target does not create the target directories, so:
sysdest=/usr/local/crm # whatever you put in the Makefile
mkdir -p $sysdest/bin
make install
- So much for the binaries. Copy needed runtime files into a per-user
directory for who is going to be trying it. Since these config
files could be different for different users, the system is not set up
to share them from a common directory, as far as I can tell.
userdest=~/.crm # wherever
mkdir $userdest
cp mailfilter.cf $userdest/
cp mailreaver.crm mailtrainer.crm shuffle.crm maillib.crm $userdest/
touch $userdest/rewrites.mfp $userdest/priolist.mfp # empty files ok
- Small edits needed in the per-user mailfilter.cf:
$EDITOR $userdest/mailfilter.cf
- Change :spw: to any string; don't remove the slashes, they
are the string delimiters (and colons surround variable names).
- Change :trainer_randomizer_command: for $sysdest,
escaping / with \, ending up with something like:
:trainer_randomizer_command: /\/usr\/local\/crm\/bin\/crm shuffle.crm/
- Change :trainer_invoke_command: in the same way.
- If you don't want crm to mess with incoming Subject: headers, change
:spam_flag_subject_string:,
:unsure_flag_subject_string:,
:confirm_flag_subject_string:,
to the empty value //.
- Change :decision_length: to something larger (I used
32000), since email headers nowadays can reach
16k. Especially if Microsoft Exchange is involved.
- By default, all incoming mail to crm is copied to the file
allmail.txt in $userdest. That may feel safer for
a while, but probably good to turn it off eventually. That config
value is :log_to_allmail.txt:.
- By default, crm114 adds a unique “comment” to the
Message-Id: header. To turn this off, comment out one line in
maillib.crm (around line 91):
$EDITOR $userdest/maillib.crm
# call /:mungmail_add_comment:/ [Message-Id: sfid-:*:cid:]
Because crm adds its own CRM114-*: headers, changing
the Message-Id: is not necessary, and can easily break parsing or use of
Message-Id:.
- Next is the initial training to create the spam.css and
nonspam.css binary files used for filtering. The trainer wants
one message per file.
- If you already have such maildirs, put them in directories
train/good and train/spam in $userdest.
- Or, if you have mbox files, my crm-util.mak
auxiliary Makefile has targets good and spam to split
them using formail
and create the dirs. As in:
cd $userdest
make -f crm-util.mak good spam
Although the docs talk about a *.tar.gz file with
pregenerated *.css files, none such is available for the current
release, as far as I could see.
- Once train/good and train/spam are populated, we
can do the training. crm-util.mak has a
target trn to do this (following recommendations in the
HOWTO), or take a look to see the command and tweak it to suit.
cd $userdest
make -f crm-util.mak trn
- A sanity test of the {non,}spam.css files just created:
the crm utility cssdiff will compare two .css files
and report.
$sysdest/bin/cssdiff spam.css nonspam.css # should be vastly different
- Given {non,}spam.css, we can run a test from the command
line to check that everything is operating ok:
# still in $userdest
make -f crm-util.mak check # which simply does:
echo 'Hello, this is a test.' | $sysdest/bin/crm mailreaver.crm
The output should be more or less comprehensible, with a bunch of hex
ids and a X-CRM114-Status that will probably be UNSURE
since there is so little input in this test.
- For additional basic checks, we can try known-good and known-spam
messages from the training set to be sure they are classified correctly:
make -f crm-util.mak checkgood
...
X-CRM114-Status: GOOD ( 24.62 )
make -f crm-util.mak checkspam
...
X-CRM114-Status: SPAM ( -31.12 )
- When ready, insert crm in the mail delivery, however it suits
you. For procmail,
I did the following to deliver good mail and save spam and unsure mail in
separate mboxes for later review. The -u argument should be the
$userdest as above.
# 'f'ilter through crm, 'w'ait for it to finish (with failure messages).
# Exit status should be good.
:0 fw: mylock-crm
* < 8200000
| nice -19 /usr/local/crm/bin/crm -u /u/karl/.crm mailreaver.crm
#
# File mail per classification. If unsure, file and also go on.
:0
* ^X-CRM114-Status: GOOD
$DEFAULT
#
:0
* ^X-CRM114-Status: SPAM
mail/spamcrm
#
# `c' carbon copy, not final delivery.
:0 c
* ^X-CRM114-Status: UNSURE
mail/unsurecrm
By default, crm will only operate on messages less than 8388608
(2^23) bytes (DEFAULT_DATA_WINDOW in crm114_config.h),
hence the size limit on running it above using <, with a
margin. This can be increased at runtime with the -w option.
- Training. Again, one msg per file. Alternative methods using
remailing, scripts, shortcuts for mutt, etc., are in the HOWTO as usual. This small
kcrm-learn script will explode an mbox
argument before learning, as above with the training.
$sysdest/bin/crm mailreaver.crm --spam <msgfile
$sysdest/bin/crm mailreaver.crm --good <msgfile
- Pruning of reaver_cache. The cache directory
$userdest/reaver_cache grows indefinitely. Prune it from cron, e.g.:
1 1 * * * find .crm/reaver_cache -mtime +1 | xargs rm -f
- Adding slots. After running with crm for a few weeks, there were
still lots of “unsure” messages, and often of the same type
day after day. To increase the number of buckets (aka slots, aka bins),
per this message:
mv spam.css savespam.css &&
$sysdest/bin/cssutil -b -r -S 2000001 spam.css &&
cssmerge spam.css savespam.css
and repeat for nonspam.css. The output from
cssmerge indicates the number of bins.
Running:
echo q | $sysdest/bin/cssutil foo.css | head -14
also
reports the total buckets and other statistics; the -14 is to
include the “overflow chain” information, i.e., hash
collisions; there shouldn't be many. (Without the |head, get
more; and can do even more interactively).
See the “Enlarging a .css file” file in the HOWTO for
more. The HOWTO says that cssmerge run with the -S argument is
equivalent to the above, but my experience is that that invocation won't
create a css with more than 1048577 buckets (i.e., 2^20+1). The
-S option (as opposed to -s) rounds up to the nearest
power of two plus one, which is recommended.