[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Multiple Routers Processing A Single Message



We're running version 2.99.38 on IRIX 5.3 systems (using EFS filesystems)
with NROUTERS set to 10.  The (I think) relevant piece of code in
router/functions.c hasn't changed as of 2.99.46 though.

I've witnessed instances of multiple routers apparently somehow renaming the
queue file and then processing it (redundantly), resulting in the the mail
being delivered as multiple discrete messages.

As an example, the following entries (split and wrapped for readability)
from the router log show three different routers grabbing the same message
and processing it independantly (right?):

<97Feb6.051600-0500_est.37331-2693+78@summit.request.net>: 
	file: 37331-2693 statistics@request.com => webmaster@bakersound.com
<97Feb6.051600-0500_est.37331-2693+78@summit.request.net>: 
	address: webmaster@bakersound.com
<97Feb6.051600-0500_est.37331-2693+78@summit.request.net>: 
	recipient: cdsmail bakersound.com webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2695+60@summit.request.net>: 
	file: 37331-2695 statistics@request.com => webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2695+60@summit.request.net>: 
	address: webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2688+65@summit.request.net>: 
	file: 37331-2688 statistics@request.com => webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2688+65@summit.request.net>: 
	address: webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2688+65@summit.request.net>: 
	recipient: cdsmail bakersound.com webmaster@bakersound.com
<97Feb6.051601-0500_est.37331-2695+60@summit.request.net>: 
	recipient: cdsmail bakersound.com webmaster@bakersound.com

I checked, and HAVE_RENAME is set to "1" in config.h .  I even traced (using
the IRIX utility "par") a router process and it is indeed rename()ing the
files:

    0mS    router( 5519): END-sginap() = 0
    0mS    router( 5519): open(., O_RDONLY, 0) = 5
    1mS    router( 5519): fcntl(5, F_SETFD, 1) OK
    1mS    router( 5519): fstat(5, 0x7fff2ae0) OK
    1mS    router( 5519): getdents(5, 0x1004758c, 4096) = 4092
    2mS    router( 5519): stat(39208, 0x7fff2ae0) OK
    2mS    router( 5519): stat(39208, 0x7fff2608) OK
    2mS    router( 5519): rename(39208, 39208-5519) OK
   35mS    router( 5519): open(39208-5519, O_RDONLY, 0666) = 6

The IRIX man page for rename(2) doesn't actually say that it's an atomic
operation.  I'm thinking that maybe it's not. :-(

Would it perhaps be appropriate to put some sort of locking around the call
to eqrename() to force single-threading in that section?  It seems like the
proper paranoid thing to do anyhow... especially for systems that don't have
rename() and must therefore make do with a link() followed by an unlink().

I'm trying this approach (using lockf() on the original file, which I know
isn't portable) just to see if it works.  I've attached a context diff (just
cut-n-pasted, so any TABs may be destroyed) to router/functions.c in 2.99.46
below just so you can see what I'm talking about.

-Andy

==============================================================================

*** functions.c.orig    Thu Feb  6 12:18:00 1997
--- functions.c Fri Feb  7 02:17:41 1997
***************
*** 596,601 ****
--- 596,602 ----
        char *sh_memlevel = getlevel(MEM_SHCMD);
        int thatpid;
        struct stat stbuf;
+       int lockfd;

        *pathbuf = 0;
        if (*dirs) {    /* If it is in alternate dir, move to primary one,
***************
*** 656,664 ****
--- 657,680 ----

          sprintf(buf, "%ld-%d", (long)stbuf.st_ino, router_id);

+           /*
+          * we lock it before the rename()... late-comers will be blocked on
+          * the lock until it's too late and the file is long gone
+          */
+           if ((lockfd = open(pathbuf, O_RDWR, 0)) < 0
+             || lockf(lockfd, F_LOCK, 0))
+           {
+               close(lockfd);
+               return 0;
+         }
+ 
          if (eqrename(pathbuf, buf, 1) < 0)
+         {
+           close(lockfd);
            return 0;           /* something is wrong, erename() complains.
                                   (some other process picked it ?) */
+         }
+         close(lockfd);
          filename = buf;
          /* message file is now "file-#" and belongs to this process */
        }