How sendmail slowed down mozilla.org's email

mozilla.org processes its email on a powerful server that also functions as the web server for http://www.mozilla.org/. Neither the email function nor the web server function is typically very taxing on the machine, although something (probably the web server) occasionally pushes the load average up into the 70s, which doesn't seem to slow down the machine significantly.

The mail processing on the machine consists of receiving email from the server that is the MX host for the mozilla.org domain (which does some spam filtering) handling mailing list expansion (which involves sending mailing lists to mailman, which then puts messages back into sendmail's queue by connecting to localhost on port 25) and alias expansion. Some of the lists have rather large numbers of subscribers, and with viruses and other problems these days, their mail hosts occasionally become unreachable for extended periods of time.

The way sendmail typically operates its main mail queue (as opposed to the client mail queue, which is almost unused on this machine) is as follows. sendmail listens for connections on port 25 as long as the load average is less than RefuseLA. When it receives an incoming message, it puts that message in the main mail queue, and, if the load average is less than QueueLA (which in typical configurations is smaller than RefuseLA), it immediately spawns a sendmail process to attempt to send that message and only that message. If that attempt to send fails, the message stays in the queue.

Then, at intervals based on a parameter given on the command line, sendmail spawns a queue runner process that makes a list of all the messages in the queue at the time it starts (I'm not sure in what order) and iterates over each one, attempting to send it if sending that message hasn't been attempted in the interval specified by the MinQueueAge option. This queue runner process terminates if the load average goes above QueueLA.

What caused our email to be delayed was that, when a virus was being spread and some mail servers of our list recipients could not be reached, the queue accumulated large numbers of messages. The queue runners then only got through a small part of the queue before they were terminated because of a bump in the load average. This meant that no attempt was made to send some of the messages in the queue for many hours in succession. Since any message that couldn't be sent immediately upon receipt or was received while the load average was greater than QueueLA but smaller than RefuseLA ended up in the queue without a process that was specifically trying to send that message, many messages were held up much longer than necessary. This problem was exacerbated when messages were accumulating in queues before ours because sendmail appears to be much more efficient at taking in messages (either from our MX host or from mailman) when it is in the queue-only mode (when the load average is between QueueLA and RefuseLA), so any backup in the mailman queue or in the queue at mozilla.org's MX host would be flushed primarily while the load average was greater than QueueLA, even if that was only a small percentage of the time.

Possible solutions using current sendmail

Once I understood the above, I saw two possible solutions to our problem given the current sendmail:

Set QueueLA significantly higher than RefuseLA and to a level the system rarely reaches (rather than the norm of slightly lower). This will allow queue runners to complete the queue, while RefuseLA still offers protection against denial of service by mailbombing.
Instead of using sendmail's -q option to spawn queue runners at intervals, spawn them from a crontab with -OQueueLA=100.

We're currently using the former option. We're also using a RefuseLA much higher than the default since the machine is so powerful, and we're starting new queue runners every 5 minutes, much more often than the default.

Design flaws in sendmail

I think this shows two design flaws in sendmail's queueing model:

Most importantly, when the load average gets too high, queue runners should pause instead of exiting so that queue runners reach each message in the queue.
It would probably also be good if QueueLA were split into two separate options: one for how to handle incoming mail and one for when an existing queue runner should stop or pause.

(Back to Linux, David Baron)

LDB, dbaron@dbaron.org, 2004-07-30