[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scheduling algorithm?

On Fri, Apr 02, 2004 at 02:37:30PM +0200, Marek Kowal wrote:
> Dear all,
> Recently we observed some problems with the delivery of the messages
> directed to the SMTP channel. The problem appears on the machines that have
> about 100 000+ messages in the queue (on the scheduler side in smtp/*), but
> actually appears to affect also systems with smaller numbers of messages in
> queues. It turns out, that - probably - the problem is also linked to the
> large number of threads with destinations that time out, not only to the
> large number of messages waiting to be delivered. 
> Anyway, as a result, some customers complain that their messages have been
> kept on the server for a very long time before being pushed out by the
> system to the outside world. And THEIR destinations are responding well, it
> is just scheduler that schedules the thread to be processed not often
> enough. The systems in question process about 20-40 mails per second,
> depending on the daytime.
> Now, we managed to speed the delivery to specific destinations by creation
> of separate thread groups in scheduler.conf. This helped things a little -
> for those channels, which are explicitly mentioned in configuration. But the
> others still suffer. And I cannot afford mantaining few hundred entries in
> scheduler.conf - just the maintanance would take ages. At the moment, I
> observe on our system about 7-10 k threads. The number of TAs is set to 800.
> Increasing the number of threads probably will help, but the machine limits
> will be hit soon, if I'll continue in this direction.

You should look into how much the threads are in different wait states.
   mailq -Q | egrep "^[a-z]|^\t| pend="

If  you see   QA=   values to be oldish, then that particular group
needs adding more transport-agents into that selector clause's
MaxRing=  parameter. Threads in W=nnn (wait for nnn seconds) state
have been tried, and been pushed back into timed retry.

I run vger.kernel.org with only slightly enlarged values for several
selector clauses.  There are about 3 000 different recipient domains,
and they get 250-350 emails per day each. They are handled with a pool
of only 200 transport agents.

> So, in order to really solve the problem, I need to know the exact
> scheduling algorithm and understand the real reason of the delays.
> But this is not very well described in the docs and leaves a lot of
> room for questions. The only sentence that gives some hint about the
> real internals of the scheduler is:
> "Also in normal cases, when messages in given thread are all delivered, or
> otherwise determined that nothing can be done, the transport agent can be
> switched to another thread within the same thread group."
> So, let us consider our setup. For simplicity, let us assume that all
> threads are in one thread group smtp/*, and that the MaxTHR is 1. The
> scheduler probably assigns the 800 TAs to the first 800 threads. Now,
> in accordance with the sentence from the docs the TAs will consume
> the messages until all are sent out, and then switch to the next 800
> threads. And so on.  Is that correct? 

No, not quite.   System observes that there are, say, 1000 threads
to process.  First 800 do get processing programs, rest end up in
one or other sort of resource wait (MaxTA, MaxChannel, MaxRing, MaxThr)

When one thread becomes completed, the transport agent will be switched
to another thread within its selector group ("ring") when such is 
available, and not in "wait-until" state.  If no new thread can be found, 
the transport agent will be placed into an 'idle' state.  

Threads start by:
 - At the arrival of a new message into a thread (unless it is
   resource starved)
 - Expiring their "wait until", and there being resources available
   (the queue is in time-order by the "wakeup" timer)
 - A transport agent becoming free, and looking for a new job

>  (this is first problem: how are threads within thread group ordered,
> when deciding whom to process next)?

More like RR (Round Robin), the threads within a selector group
are in a circullar linked list.  However _new_ entries are added
into the "tail" of the ring ( = previous of current ), but also
the location of "current" keeps changing.

> What if one of the threads contains 10k messages? Will it attempt
> deliver all 10k at one go?

Yes, unless the transport-agent decides to abandon the attempt.

> How does scheduler treat the messages, that appeared in actually 
> processed thread while it was already being processed? Are they
> sent out in the same iteration, or are they left for the next
> iteration?

They will be sent out in same transport agent run.

> Also, will scheduler return to the first thread only after all other 7k-799
> threads have been scanned (sort of FIFO?). That would explain, why I am
> seeing such delays - it probably takes several minutes to scan the whole set
> of threads and return to the first one.

Essentially yes. (and 'round-robin')

> Also, under such heavy loads, what is (if there is any) the meaning of the
> interval configuration parameter? Also, what happens if the destination of
> the thread is not responding, and thread has many messages waiting, and new
> appear at constant rate? Will scheduler give up the whole thread after few
> first tries on first few messages and switch to the next thread, or will it
> keep on trying each of the messages in the queue in accordance with
> retries(....) specification? If so, when will it give up and for how long?

There is a "slow start" in job feeding.  At first one, and if it
succeeds (= does not "retry"), then the flood gates open.
This way lots of messages to non-responding destination won't
take much effort before going back into "wait until" queue.

When a transport agent decided latter to retry, it does repeat
the retry response for as many messages as there are overfed
job specifiers.  That will result in a bit dubious extreme
delay pushback...

> This is partly covered in the docs
> http://zmailer.org/zman/zadm-scheduler.shtml, but it is not clear to me. If
> 4000 of the destinations are not responding (this is what basically hapens,
> they usually time out, and as a result more than 100k messages cannot be
> delivered and after few days they will be returned to the senders), what
> will the scheduler do? Retry each message?

Retry each queue, and if a message that has had delivery attempts does
time out, return it in 'expiry=' age.  A message not having any
delivery attempts will anyhow be returned at 'expiry2=' age.
If there is no 'expiry2=' set, it does happen 24 hours after
normal expiry.

> And now two questions related to such setups:
> 1. how do you handle such heavy loads? I mean, some of you must have
> experienced the same problems. 

By selective pickup of where to place resources.
Filesystems need to be fast, yet secure, underlying disks
need to supply lots of IOPS, etc...  A raid 1+0 (mirror of
underlying striped disksets) gives usually best speed.
> 2. the startup phase. it takes about one hour to scan the control
> files after the restart of the scheduler - it opens all those control
> files and there are lots of them. any ideas how to speed things up?
> (apart from buying better hardware, that is ;-)))

By default the scheduler startup is  synchronous  (the -S option in
its command line,)  if you make it non-synchronous, the queue
assimilation will take forever, but activity does start right away.

A bunch of small but fast disks gives best performance.

While I use journaled filesystems in all things that I do, they
are not very fast, when using same disk(s) for journal, as
where the main data is.  Also, some systems have additional
operating modes in their journaling.

Not using journal has risks in form of loosing filesystem
coherence in case of system failure ... but e.g. Linux ext2
in full async mode is blazing fast...

I have even considered purchasing "Rocket Drive" RAM cards to
be used as a journal devices (with battery backup, of course.)
(They cost around USD $ 1000,- per gigabyte)

The postoffice spool does not need to have atime update, that
is, it can use 'noatime' mount option, if such is available.
Every little optimization...

> Cheers,
> Marek

/Matti Aarnio	<mea@nic.funet.fi>
To unsubscribe from this list: send the line "unsubscribe zmailer" in
the body of a message to majordomo@nic.funet.fi