[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

router memory



What strategy is there to handle runaway memory demands in the router?

We are currently migrating from zmailer 2.2 to 2.99.48p6. We handle about
8,000 messages a day. A couple of times a year we get a message with badly
formed rfc822 which causes the router to gobble up all available memory.
(e.g. in both versions of zmailer, an address with unbalanced quotes and
brackets will do this.) We have little problem with the router not
handling this kind of message--we can manually handle such mail and
we recognize that stamping out these sort of rare problems is difficult.
But zmailer must be robust enough to handle the consequences.

The Zmailer router obtains memory using malloc from its emalloc
function.  When the default emalloc gets an error from malloc it goes
into an infinite loop of sleeping for a short period and doing the
malloc again.  This aggressive behavior, coupled with the router's
runaway hunger for memory, will bring down the whole UNIX system. (As
other UNIX processes crash due to virtual memory exhaustion, the
emalloc sleep/malloc loop grabs the memory which is released.)

Setting a virtual memory ulimit on the router processes will prevent
the whole UNIX system from dying. But now emalloc will go into an
infinite loop of failing malloc (because it is at the ulimit) and
sleeping.  This ties up a router indefinitely. In one case where the
same message was sent repeatedly, it tied up all our router processes
and stopped all mail.

There is an alternate emalloc supplied with zmailer which doesn't do
the sleep/malloc loop. (That version also uses a supplied malloc rather
than the system supplied one.) But it simply does an exit in the
emalloc when the malloc fails (due to a ulimit or due to virtual memory
exhaustion.) Hence the router dies suddenly. With zmailer 2.2 the
message which caused the problem stays in the queue and zmailer won't
touch it again.  Unfortunately, unlike zmailer 2.2, when the zmailer
2.99.* router is restarted, it aggressively puts messages back in the
router queue which were obviously mishandled before (ie it renames the
"filename-pid" files in postoffice/router to "filename".) Hence when we
restart the router to recover the router processes that have died, the
new router processes pick up the same messages and die again.

To make the zmailer router more robust, I suggest:

- the emalloc sleep/malloc loop should be removed, or at the very least
  made an install option. If it must remain, then the loop should be
  traversed no more than 2 times.

- when a malloc fails the router should handle the situation by moving
  the message involved aside (postoffice/deferred?) for manual handling.

- when the router is restarted it should not process messages which
  were previously assigned to other routers; ie it should not rename
  "filename-pid" files to "filename". Doing so leads to infinite loops
  of router crashes. 
  Alternatively, it should limit reprocessing such files to one attempt
  (which could be done by having the router restart rename change
  "filename-pid" to "filenameX", but never rename "filenameX-pid",
  and have the router process "filenameX" files as it
  does "filename" files.)

- it might be nice to provide a router parameter to ulimit virtual
  memory and another to set a time limit per message.

Alex Nishri
University of Toronto
email: alex.nishri@utoronto.ca