Log in

Migration to Exim 4, greylisting and bogofilter. - Alex Belits
Migration to Exim 4, greylisting and bogofilter.
My mail server in Denver was running Exim 3 for at least five years. Spam filtering was done with a simple setup with Bogofilter called by a delivery agent wrapper that I wrote to avoid using large monstrosities in perl on a resources-starved server. Mail delivery agent wrapper also performed another function -- fixed non-ASCII headers before passing mail for delivery, so Cyrus won't complain about them. Filtering was done by running bogofilter on the message, checking its result and passing "-m spam" to Cyrus delivery agent if the message is supposed to go into spam mailbox. If user had "spam" mail folder under "INBOX", spam ended up there, otherwise Cyrus will put it into INBOX, but the message will still have identifying headers that mail reader can use to mark it as spam.

Bogofilter is a bayesian filter, and it requires training on sample spam and non-spam messages to maintain wordlists used for spam recognition. For most of the server's lifetime I prepared sample sets of messages, used them to update the wordlists, and it was sufficient to keep most of the spam away. When I left Denver in May I stopped doing those updates, and a week ago when I finally got back to it I was faced with an enormous set -- 400000 spam messages (some recognized by bogofilter, some manually filtered) and tens of thousands non-spam messages. Update was long overdue -- though I couldn't find any false positives (erroneously marked as spam), mailboxes were flooded with spam that passed bogofilter unscathed. However even the update didn't bring recognition to a usable level -- apparently spammers' attempts to dilute spam with large chunks of unrelated text weren't enough to make filters useless or to create false positives, but worked sufficiently well to irritate me and other users on that server.

I have decided to try adding greylisting -- a method that relies on the fact that most of spammers skimp on supporting proper handling of errors, and legitimate servers don't. The idea is simple -- server keeps a list of verified combinations of origination address, destination address and IP address of the server that sent email between them. Originally the list is empty. When mail server is trying to send a mail between addresses it was not verified for, mail is temporarily rejected until some time t passes since the first attempt of its delivery. If mail server retries delivery after this initial period of time, mail is passed, and new combination of addresses is added to the list. Proper handling of temporary errors is an important part of mail protocols' reliability, so it can be expected that legitimate email will pass through this process with delay between t and a few hours (more likely shorter than longer) while most spammers' mail servers will treat first error as fatal, or will re-send message generated on the fly, with non-matching source address. The whitepaper I have linked above explains it in more details, but the basic idea is this simple.

To implement this I had to upgrade Exim to version 4, however the problem is, there were significant configuration format changes between Exim 3 and Exim 4, so I had to reproduce all features of existing configuration in Exim 4 compatible way before switching. This ended up being less straightforward than I expected. Quick look at the differences between default Exim 3 configuration file supplied by Debian package and my custom configuration confirmed that there are two main differences, both sadly not implemented in example configuration files for either Exim 3 or 4:
  1. Mail delivery was done by Cyrus, in this case with my wrapper, though the presence of wrapper only affects the name of the file to run. By default Exim 4 configuration procedure produced either mailbox format (old-style giant text files in /var/mail ) or maildir (a file per message, just like Cyrus but not indexed and tied to local users). Exim 4 comes with configuration file for maildrop, what is close enough -- I only had to change the command line and add a line "group = mail" to reproduce the original configuration for transport "fwrapper_pipe":
      debug_print = "T: fwrapper_pipe for $local_part@$domain"
      driver = pipe
      path = "/bin:/usr/bin:/usr/local/bin"
      command = "/usr/sbin/fwrapper -a cyrus ${local_part}"
      group = mail

    However that was not the end of my problems. Exim 4, as opposed to Exim 3, passes the first "envelope" line ("From", space, envelope address) to the mail delivery agent, in the same way how that line is added to the message in the mailbox file. When wrapper passes that to Cyrus mail delivery agent, the envelope line remains in the chunk of text it sends through LMTP protocol to Cyrus server, what is not a valid LMTP. I have changed fwrapper source to remove envelope line from the header, and everything worked.

  2. Some domains served by this server were configured with "smart user" (an equivalent of "smart host", not in any way related to intellectual capabilities of those, or other users) -- all mail sent to some domain, regardless of user name, is sent to one user who is supposed to sort it out (manually, automatically or not at all) just like "smart host" in a typical mail configuration sorts out all outgoing mail, so other servers just send it all to him. Router "smartuser" ended up looking like this:
      debug_print = "R: smartuser for $local_part@$domain"
      driver = redirect
      domains = lsearch;/etc/exim/virtual
      data = ${lookup{$domain}lsearch{/etc/exim/virtual}{$value}fail}

Of course, there was also a matter of actually configuring Exim 4. Server runs Debian Linux, so it is supposed to be a good idea to let Debian packages manage the upgrade and configuration. When I installed exim4 package, installer asked to move the current queue to the new location (didn't really matter for me because queue consisted entirely of error responses to spam), and to create configuration as either multiple files or single monolithic file. I have chosen the monolithic option, though I have immediately found out that it only meant that Exim is presented with a monolithic configuration file, assembled on startup from multiple pieces that contained the real persistent configuration.

Even though installer was aware of Exim 3 mail queue, it made no attempt of copying any configuration options, and asked me to re-enter local and relayed domains, system name, etc. For delivery methods it only presented options of using mailbox files and maildir (see above). I have chosen mailbox, so I can later replace it with my wrapper. Elaborate system of configuration files was not in any way reflected in initial configuration process, either -- it only substituted few mandatory parameters, generated the template file used to produce the ephemeral monolithic configuration file, and left it at that.

I have manually edited the template file and placed the above mentioned definitions into new files to make them compatible with this Debian-ish configuration system, re-ran configuration generation procedure and ended up with a close equivalent with the original Exim 3 configuration. While routing for domains was added by the virtue of a file being present in /etc/exim4/conf.d/router directory (the order of the rules is determined by the prefix, I have chosen 550, so the file became /etc/exim4/conf.d/routers/550_exim4-config_smartuser, delivery method has to be placed into template configuration file /etc/exim4/update-exim4.conf.conf (sic). The entry looks like this:

After editing all this, I ran update-exim4.conf.template -r, and started exim: /etc/init.d/exim4 start

The last step was to install greylistd, that is available as a Debian package specifically made for Exim 4, so its installation procedure properly attached it to Exim configuration. After seeing it working I had to add mailing lists' servers to a whitelist because they happen to use unique envelope origination addresses for every email instead of setting it to the mailing list address. I didn't want to see all mailing list messages delayed, so I had to whitelist their IP addresses.

I am running this configuration for almost a week now. The amount of spam went down at least ten times when counted before bogofilter, and spam now mostly consists of short messages containing a random phrase and a URL. Apparently long messages are all sent by botnets, viruses, and spam-specific software while short ones are usually passed through regular mail servers -- I have found what looks like signatures of Microsoft Exchange, Sendmail, Qmail and Exim in them. Bogofilter still filters most of them out, however their short size and legitimate headers make them the most difficult to filter, and I still get tens of them per day in my regular mailbox. I will see if more spammers will switch to this mode that may prompt me to add another kind of filtering specifically against that style of spam.

Tags: ,
Current Mood: accomplished accomplished

Leave a comment