Initial CRM114 installation

My bare installation steps for the crm mail filter (wikipedia, sourceforge, mailing lists). This is all essentially extracted from the HOWTO included in the distribution; lots more details there, and everywhere else. Of course everything should be adjusted to local/personal needs; there are alternatives at every step.

  1. Download, unpack. (Although some docs give a 2009 release as the latest, this one is available, so went with it.)
    wget -nv
    tar xf crm114-*
    cd crm114-*

  2. Download required library tre, unpack. It's also available separately and for some distros, but this version is linked from the crm page, so went with it.
    wget -nv
    tar xf tre-*
    cd tre-*

  3. Build and install tre. For simplicity: static only, and just install in our working directory.
    ./configure --prefix=`cd .. && pwd`/tre --enable-static --disable-shared &&
    make &&
    make install &&
    cd ..

  4. Edit the crm Makefile and build: this crm-makefile.patch makes the destination be /usr/local/crm (didn't bother with 114 for consistency with the binary name, and didn't bother with version number since it seems no new release is forthcoming); compiles and links with the local tre library we just installed here; omits completely static linking since I do not have static versions of all needed libraries and it seems unnecessary. As far as I've seen, the current sources work ok on 64-bit systems; the docs are inconsistent about this.
    patch <crm-makefile.patch

  5. The install target does not create the target directories, so:
    sysdest=/usr/local/crm # whatever you put in the Makefile
    mkdir -p $sysdest/bin
    make install

  6. So much for the binaries. Copy needed runtime files into a per-user directory for who is going to be trying it. Since these config files could be different for different users, the system is not set up to share them from a common directory, as far as I can tell.
    userdest=~/.crm # wherever
    mkdir $userdest
    cp $userdest/
    cp mailreaver.crm mailtrainer.crm shuffle.crm maillib.crm $userdest/
    touch $userdest/rewrites.mfp $userdest/priolist.mfp # empty files ok

  7. Small edits needed in the per-user
    $EDITOR $userdest/
    1. Change :spw: to any string; don't remove the slashes, they are the string delimiters (and colons surround variable names).
    2. Change :trainer_randomizer_command: for $sysdest, escaping / with \, ending up with something like:
      :trainer_randomizer_command: /\/usr\/local\/crm\/bin\/crm shuffle.crm/
    3. Change :trainer_invoke_command: in the same way.
    4. If you don't want crm to mess with incoming Subject: headers, change :spam_flag_subject_string:, :unsure_flag_subject_string:, :confirm_flag_subject_string:, to the empty value //.
    5. Change :decision_length: to something larger (I used 32000), since email headers nowadays can reach 16k. Especially if Microsoft Exchange is involved.
    6. By default, all incoming mail to crm is copied to the file allmail.txt in $userdest. That may feel safer for a while, but probably good to turn it off eventually. That config value is :log_to_allmail.txt:.

  8. By default, crm114 adds a unique “comment” to the Message-Id: header. To turn this off, comment out one line in maillib.crm (around line 91):
    $EDITOR $userdest/maillib.crm
    # call /:mungmail_add_comment:/ [Message-Id: sfid-:*:cid:]
    Because crm adds its own CRM114-*: headers, changing the Message-Id: is not necessary, and can easily break parsing or use of Message-Id:.

  9. Next is the initial training to create the spam.css and nonspam.css binary files used for filtering. The trainer wants one message per file. Although the docs talk about a *.tar.gz file with pregenerated *.css files, none such is available for the current release, as far as I could see.

  10. Once train/good and train/spam are populated, we can do the training. crm-util.mak has a target trn to do this (following recommendations in the HOWTO), or take a look to see the command and tweak it to suit.
    cd $userdest
    make -f crm-util.mak trn

  11. A sanity test of the {non,}spam.css files just created: the crm utility cssdiff will compare two .css files and report.
    $sysdest/bin/cssdiff spam.css nonspam.css # should be vastly different

  12. Given {non,}spam.css, we can run a test from the command line to check that everything is operating ok:
    # still in $userdest
    make -f crm-util.mak check # which simply does:
    echo 'Hello, this is a test.' | $sysdest/bin/crm mailreaver.crm

    The output should be more or less comprehensible, with a bunch of hex ids and a X-CRM114-Status that will probably be UNSURE since there is so little input in this test.

  13. For additional basic checks, we can try known-good and known-spam messages from the training set to be sure they are classified correctly:
    make -f crm-util.mak checkgood
    X-CRM114-Status: GOOD ( 24.62 )

    make -f crm-util.mak checkspam
    X-CRM114-Status: SPAM ( -31.12 )

  14. When ready, insert crm in the mail delivery, however it suits you. For procmail, I did the following to deliver good mail and save spam and unsure mail in separate mboxes for later review. The -u argument should be the $userdest as above.
    # 'f'ilter through crm, 'w'ait for it to finish (with failure messages).
    # Exit status should be good.
    :0 fw: mylock-crm
    * < 8200000
    | nice -19 /usr/local/crm/bin/crm -u /u/karl/.crm mailreaver.crm
    # File mail per classification. If unsure, file and also go on.
    * ^X-CRM114-Status: GOOD
    * ^X-CRM114-Status: SPAM
    # `c' carbon copy, not final delivery.
    :0 c
    * ^X-CRM114-Status: UNSURE

    By default, crm will only operate on messages less than 8388608 (2^23) bytes (DEFAULT_DATA_WINDOW in crm114_config.h), hence the size limit on running it above using <, with a margin. This can be increased at runtime with the -w option.

  15. Training. Again, one msg per file. Alternative methods using remailing, scripts, shortcuts for mutt, etc., are in the HOWTO as usual. This small kcrm-learn script will explode an mbox argument before learning, as above with the training.
    $sysdest/bin/crm mailreaver.crm --spam <msgfile
    $sysdest/bin/crm mailreaver.crm --good <msgfile

  16. Pruning of reaver_cache. The cache directory $userdest/reaver_cache grows indefinitely. Prune it from cron, e.g.:
    1 1 * * * find .crm/reaver_cache -mtime +1 | xargs rm -f

  17. Adding slots. After running with crm for a few weeks, there were still lots of “unsure” messages, and often of the same type day after day. To increase the number of buckets (aka slots, aka bins), per this message:
    mv spam.css savespam.css &&
    $sysdest/bin/cssutil -b -r -S 2000001 spam.css &&
    cssmerge spam.css savespam.css

    and repeat for nonspam.css. The output from cssmerge indicates the number of bins.

    echo q | $sysdest/bin/cssutil foo.css | head -14
    also reports the total buckets and other statistics; the -14 is to include the “overflow chain” information, i.e., hash collisions; there shouldn't be many. (Without the |head, get more; and can do even more interactively).

    See the “Enlarging a .css file” file in the HOWTO for more. The HOWTO says that cssmerge run with the -S argument is equivalent to the above, but my experience is that that invocation won't create a css with more than 1048577 buckets (i.e., 2^20+1). The -S option (as opposed to -s) rounds up to the nearest power of two plus one, which is recommended.

Good luck. —karl.

$Date: 2020/02/09 22:38:42 $