[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mea's dump version 2.98 is available..




David,

Yes, I have seen this happening.  The problem is in data timeouts
and child deaths.

The server needs to wait WITHOUT hanging on wait().  The way
the code is written with USE_UNIONWAIT not defined for Solaris
a simple wait() is compiled in without the WNOHANG option.
This is the cause of the problem and another related one.

A rough description of what I think happens is the following:
A child is forked to serve a connection.  This first connection
will later time out.  In the mean time a second connection is
made and is completely served and exits normally.  Now the
problem starts.  When the second child exits the server catches
the signal normally and the reaper() function is called.  The
reaper() calls wait() to get the child status but it calls wait()
in HANG mode rather than say, calling waitpid() with WNOHANG (my fix).
It gets the status of the finished process but because it is in
a while loop AND it has another child to wait for it waits!
Also, upon entry to reaper() after catching SIGCHLD, the signal
is not reset to catch (or ignore) a SIGCHLD again until the
END of the reaper function - this is still a potential problem.
Continuing... when the first child times out and exits something
seems to happen in Solaris 2.3 (only in some patches???) that causes
the parent to ?get another signal? after the 2nd wait() is completed
for the first (timed out) process. 

The other problem with wait() without the WNOHANG causes a related
problem in that when the server is in the waiting state it will
NOT accept any new connections because, of course,
it is stuck in the wait() call....  oops :^).

The major part of the fix involved changes to smtpserver.c.
I removed the #ifdef's from around the #include <sys/wait.h> at
the top of smtpserver.c:
#ifdef  USE_UNIONWAIT
#include <sys/wait.h>
#endif  /* USE_UNIONWAIT */

And I changed the reaper function by adding the lines indented here
below with ">> "i:
 
static SIGNAL_TYPE
reaper( int sig )
{
#ifdef  USE_UNIONWAIT
        union wait status;
#else   /* !USE_UNIONWAIT */
        int status;
#endif  /* USE_UNIONWAIT */
 
>> #if !defined(USE_UNIONWAIT) && defined(WNOHANG)
>>         while ( waitpid( (pid_t)-1, &status, WNOHANG ) > 0 )
>> #else
#ifdef  WNOHANG
        while (wait3(&status, WNOHANG, (struct rusage *)NULL) > 0)
#else   /* !WNOHANG */
        while (wait(&status) > 0)
#endif  /* WNOHANG */
>> #endif  /* !defined(USE_UNIONWAIT) && defined(WNOHANG) */
                continue;
        signal(SIGCHLD, reaper);
}
 
I have other bug fixes for 2.97 (and presumably 2.98 as well)
but I just don't have the time to post them all/send to Matti at present.
Plus our code has some extensions, etc. and I have hacked it a bit so
that it might not be as portable/compatible(BSD) as it used to be.

Cheers,

Chris Healey
Sr. Engineer
Global Village Communications
(408) 523-2072

> 
> 
> > This 2.98 dump is tested (to a certain extend) with
> > 	Solaris 2.3
> > 	SunOS 4.1.3
> > 	Linux (1.1.72)
> > and apparently it works...
> 
> I am noticing on our Solaris machine that the
> smtpserver just disappears (no core, nor syslog error messages) from time to
> time. It is seemingly a random occurance but happens often enough (about 4x
> a week) that has me wondering if enybody else has seen this happening and/or
> if this if a known bug that is fixed in 2.98.
> 
> We are currently running 2.97
> 
> David M. Anthony, Sr. Network Specialist          danthony@onet.on.ca
> ONet Networking Inc.
> 4 Bancroft Ave., Rm 101                           Thought provoking saying
> Toronto, ON, M5S 1A1                              still being developed...
> 

> 
> 
> > This 2.98 dump is tested (to a certain extend) with
> > 	Solaris 2.3
> > 	SunOS 4.1.3
> > 	Linux (1.1.72)
> > and apparently it works...
> 
> I am noticing on our Solaris machine that the
> smtpserver just disappears (no core, nor syslog error messages) from time to
> time. It is seemingly a random occurance but happens often enough (about 4x
> a week) that has me wondering if enybody else has seen this happening and/or
> if this if a known bug that is fixed in 2.98.
> 
> We are currently running 2.97
> 
> David M. Anthony, Sr. Network Specialist          danthony@onet.on.ca
> ONet Networking Inc.
> 4 Bancroft Ave., Rm 101                           Thought provoking saying
> Toronto, ON, M5S 1A1                              still being developed...
>