[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multiple Routers Processing A Single Message



On Fri, 7 Feb 1997, Matti Aarnio wrote:
> > We're running version 2.99.38 on IRIX 5.3 systems (using EFS filesystems)
> > with NROUTERS set to 10.  The (I think) relevant piece of code in
> > router/functions.c hasn't changed as of 2.99.46 though.
> 
> 	It hasn't.  I am more and more convinced that it should
> 	revert back to the old  link()/unlink() method, which
> 	behaviour apparently can be made completely deterministic,
> 	and most definitely it will not behave in this SysVr4
> 	manner that we so much see...

Just to get a sense of history, I retrieved and looked at the version 2.2.1
router code and it's essentially identical, except that it renames files to
filename(router_number).  


> > The IRIX man page for rename(2) doesn't actually say that it's an atomic
> > operation.  I'm thinking that maybe it's not. :-(
> 
> 	Yes, if it yields an error, the target name may already exist,
> 	but the origin has not yet been removed...
> 	Umm.. EBUSY error it was ? (At least with Solaris)
[...]
> > Would it perhaps be appropriate to put some sort of locking around the call
> > to eqrename() to force single-threading in that section?  It seems like the
> > proper paranoid thing to do anyhow... especially for systems that don't have
> > rename() and must therefore make do with a link() followed by an unlink().
> 
> 	Perhaps not -- those are safe from this rename() fallacy  :-(

Just in case you didn't catch it (in retrospect I wan't very explicit), the
problem is that while one router process is trying to rename() (or
link/unlink) a file, one or more other router processes can also rename() or
link() it - yielding more than one router process that each think they have
an exclusive copy.

I looked at the alternative link()/unlink() code and it is even more
vulnerable to the same race condition.  Between the link() and unlink(),
another router process could also perform a link() (to a different filename).

I just cannot see any way to avoid having to use some locking method to
avoid rename()/link() races between competing router processes.  As I see
it, you have a choice of forcibly single-threading the entire directory
scanning process (sub-optimal) or forcibly single-threading just the little
section of code that renames the file.

If you don't like the idea of locking the msg file itself, then perhaps some
"central" synchronization method could be used.  E.g. have the router
processes lock a single (always present) file in that (or some other)
directory.  I'm trying to consider portability here... a semaphore would
probably be the best solution, but who wants to deal with portably using
semaphores...


> > I'm trying this approach (using lockf() on the original file, which I know
> > isn't portable) just to see if it works.

FYI, I have received no further complaints of redundant messages since
adding the locking code, but I won't feel safe for a couple more days.

-Andy