[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

duplicate delivery with Solaris



Synopsis: Two (concurrent) router processes work on same message with Solaris
Zmailer: 2.99.48p6 but probably other versions as well.
OS:	Solaris 2.5 but probably Solaris versions as noted below.

Symptoms:

Message in the postoffice/deferred queue raises alarm bells.

$ ls -li /var/spool/postoffice/deferred/*
  81067 -rw-r----- 2 daemon root 3927 Nov 30 15:33 /var/spool/postoffice/deferred/81067.ctrlfile

It's also linked into the scheduler's queue directory

$ ls -li /var/spool/postoffice/queue/81067*
  81067 -rw-r----- 2 daemon root 3927 Nov 30 15:33 /var/spool/postoffice/queue/81067-17317
  
Smtpserver log says message come in via smtp connection started 15:33:12,

18461w  250 2.6.0 S.oUQq781067 message accepted
18461#  S.oUQq781067: 3927 bytes
  
Router log shows,

<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: address: validfrom@h-net.msu.edu
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: file: 81067-17327 <validfrom@h-net.msu.edu>...
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: file: 81067-17327 <validfrom@h-net.msu.edu> =>  <validto@UTORONTO.CA>
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: address: validto@UTORONTO.CA

then later

<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: address: validfrom@h-net.msu.edu
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: file: 81067-17317 <validfrom@h-net.msu.edu>...
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: file: 81067-17317 <validfrom@h-net.msu.edu> =>  <validto@UTORONTO.CA>
<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>: address: validto@UTORONTO.CA
router: rename(81067-17317,../queue/81067-17317): Device busy

Looking at the mail log we have,

Nov 30 15:33:13 thismachine router[17327]: S.oUQq781067: from=<validfrom@h-net.msu.edu>, rrelay=NO-IDENT-SERVICE@h-net.hst.msu.edu (port 1138) ([35.8.2.57]) size=3927, nrcpts=1, msgid=<Pine.LNX.3.96.971130153353.22192A-100000@h-net2.msu.edu>
Nov 30 15:33:14 thismachine smtp[1453]: S.oUQq781067: to=<validto@othermachine.utoronto.ca>, delay=00:00:01, xdelay=00:00:00, mailer=smtp, relay=othermachine.utoronto.ca ([128.100.xx.xx]), stat=ok2 250 Roger
Nov 30 15:33:14 thismachine router[17317]: 81067-17317; routing deferred


>From these logs we conclude the same message was picked up by two routers
processes, first by router[17327], then by router[17317]. Both routers have
the same parent (from our regular monitoring "ps" logs) and both were up for
periods before and after this incident (ie this is not a race condition
during router startup).

Router[17327] finished with this message at 15:33:13 and then an smtp
transport successfully sent the message finishing at 15:33:14.  We found no
control file for message 81067 and suspect it's because the message was
delivered.

Router[17317] was the second router to try to process the same message. It
said, "router: rename(81067-17317,../queue/81067-17317): Device busy" and
left a link to the message in deferred/81067.ctrlfile.

Why "Device busy" from rename(81067-17317,../queue/81067-17317)?  According
to Sun's tech database, unlink() returns EBUSY when 2 threads concurrently
unlink hardlinks to same inode. The description says this has been reproduced
with Solaris 2.3, 2.4, & 2.5, but not SunOS 4.1.3_U1 (probably because the
kernel for the later is single threaded).  We don't know about Solaris 2.6.
Our tests indicate that rename() has the same problem as unlink().
Reference:
Bug 1212953:
http://sunsolve.sun.com/private-cgi/us/doc2html?bug/kernel/ufs/1212953+88137433122716

Initial Hypothesis:

There are two problems here,

1) How did two routers end up processing the same message?
2) The first router obviously succeeded in relinking the message into ../queue.
   Why did the second router fail in doing this?
   
Starting with problem #2 first, hypothesis is that
rename(81067-17317,../queue/81067-17317) in router[17317] failed with
EBUSY because scheduler was concurrently doing
unlink(queue/81067-17327) where both links are actually to the same
file. (Remember the first router was finished with the message and the
smtp transport was just finishing up as the second router picked up the
message.)

Now for our hypothesis for Problem #1. Router[17327] and router[17317] both
look in postoffice/router directory for work.  Both routers select the same
message in their rd_stability or rd_instability routine and call rd_doit.
There both routers call eqrename to rename the file; router[17327] wants to
rename the file 81067->81067-17327 while router[17317] wants to rename the
file 81067->81067-17317. The result is as follows:

1. Router[17327] sees 81067 and wants to take it for processing
2. Router[17317] sees 81067 and wants to take it for processing
3. Router[17327] links(81067,81067-17327)
4. Router[17317] links(81067,81067-17317)
5. Router[17327] unlinks(81067) and starts to process 81067-17327
6. Router[17317] fails to unlink(81067) so it correctly realizes that
                 another router has grabbed the message. It then does an
                 unlink(81067-17317) but that can fail with an EBUSY
                 (as per Sun's tech database item above)--when it does
                 file 81067-17317 will be left around. In either case
                 router[17317] knows that another router has this message
                 and will not be processing it.

So the result is that file 81067-17317 will be left around.

Now other routers don't touch file 81067-17317 because they see that
Router[17317] is still alive (in their rd_doit routine). But later
Router[17317], when looking for new work, finds 81067-17317, recognizes it
already owns it, knows it doesn't need to rename as above, and starts to
process it--this is the incorrect second processing of the same message.
(Should Router[17317] die before processing it, another process will
notice this, repossess the message, and start processing it--this will
then be the incorrect second processing of the same message.)

Does this make sense?

This message had problems #1 & #2, each of which depends on getting an EBUSY.
The probability of getting EBUSY must be low. So what if a message only saw
EBUSY as in problem #1, but not in #2? Then it would be processed twice by
separate routers, and get delivered twice. We would expect the probability of
delivering messages twice should be higher than getting the messages in the
postoffice/deferred directory as above.

Looking at the the logs for the same day, of approximately 24,000 messages
handled 24 spoolids (tsapid) appear twice in router records. I examined a
handful. In each case a message arrives on the smtpserver, gets processed by
two routers, and gets delivered twice.

In the same time period one message ended up in postoffice/deferred
presumably because of the lower probability of getting the EBUSY twice, once
in #1 and once in #2.

Peter Ip/Michael Simms/Alex Nishri
Network Services
University of Toronto