mozilla.org processes its email
on a powerful server that also functions as the web server for
http://www.mozilla.org/
. Neither the email function nor
the web server function is typically very taxing on the machine,
although something (probably the web server) occasionally pushes the
load average up into the 70s, which doesn't seem to slow down the
machine significantly.
The mail processing on the machine consists of receiving email from the server that is the MX host for the mozilla.org domain (which does some spam filtering) handling mailing list expansion (which involves sending mailing lists to mailman, which then puts messages back into sendmail's queue by connecting to localhost on port 25) and alias expansion. Some of the lists have rather large numbers of subscribers, and with viruses and other problems these days, their mail hosts occasionally become unreachable for extended periods of time.
The way sendmail typically operates
its main mail queue (as opposed to the client mail queue, which
is almost unused on this machine) is as follows. sendmail listens for
connections on port 25 as long as the load average is less than
RefuseLA
. When it receives an incoming message, it puts
that message in the main mail queue, and, if the
load average is less than QueueLA
(which in typical
configurations is smaller than RefuseLA
), it immediately
spawns a sendmail process to attempt to send that message and only that
message. If that attempt to send fails, the message stays in the
queue.
Then, at intervals based on a parameter given on the command line,
sendmail spawns a queue runner process that makes a list of all the
messages in the queue at the time it starts (I'm not sure in what order)
and iterates over each one, attempting to send it if sending that
message hasn't been attempted in the interval specified by the
MinQueueAge
option. This queue runner
process terminates if the load average goes above
QueueLA
.
What caused our email to be delayed was that, when a virus was being
spread and some mail servers of our list recipients could not be
reached, the queue accumulated large numbers of messages. The queue
runners then only got through a small part of the queue before they were
terminated because of a bump in the load average. This meant that no
attempt was made to send some of the messages in the queue for many
hours in succession. Since any message that couldn't be sent
immediately upon receipt or was received while the load average was
greater than QueueLA
but smaller than RefuseLA
ended up in the queue without a process that was specifically trying to
send that message, many messages were held up much longer than
necessary. This problem was exacerbated when messages were accumulating
in queues before ours because sendmail appears to be much more
efficient at taking in messages (either from our MX host or from
mailman) when it is in the queue-only mode (when the load average is between
QueueLA
and RefuseLA
), so any backup in the
mailman queue or in the queue at mozilla.org's MX host would be flushed
primarily while the load average was greater than QueueLA
,
even if that was only a small percentage of the time.
Once I understood the above, I saw two possible solutions to our problem given the current sendmail:
QueueLA
significantly higher than
RefuseLA
and to a level the system rarely reaches (rather
than the norm of slightly lower). This will allow queue runners to
complete the queue, while RefuseLA
still offers protection
against denial of service by mailbombing.-OQueueLA=100
.We're currently using the former option. We're also using a
RefuseLA
much higher than the default since the machine is
so powerful, and we're starting new queue runners every 5 minutes, much
more often than the default.
I think this shows two design flaws in sendmail's queueing model:
QueueLA
were split
into two separate options: one for how to handle incoming mail and one
for when an existing queue runner should stop or pause.(Back to Linux, David Baron)
LDB, dbaron@dbaron.org, 2004-07-30