[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SMTP transport hangs



> Matti,
> 
> I'm experiencing this problem for a *long* time, but as this happens
> rarely, it did not bother me much until recently.
> 
> Sometimes smtp transport process hangs and stays in the process list for
> a long time (days).  This effectively stops the queue for the domain it
> is contacting to.  The process uses zero CPU and leaves *no* traces in
> the smtp log.  Netstat does not show any open connection on the receiving
> end (and I cannot check if there is a connection associated with this
> process on the originating end).
> 
> Today, I caught such process and killed it with sig 11 (with -QUIT, it
> leaves no core file?).  Zmailer 2.99.47 + all patches, SPARC Solaris 2.5.1.
> This is what gdb shows:

	If you started the scheduler with (as fairly usual default)
		ulimit -c 0
	then schedulers children (smtp among them) will not drop cores..

	However the trace you present below looks familiar, and
	I propably did fix it.  "patch1c" at my workstation.
 	("patch1", version "c", on top of the 2.99.48 package there)


	Several bugs there, memory leaks, and writing past the end of
	an array...  (last of which can cause mysterious errors...)


	Your debugger does not list symbols from within the dynamic
	libraries.  Have you tried gdb 4.16 ?

> Core was generated by `/usr/zmailer/bin/ta/smtp -s8H -l /var/log/zmailer/smtp'.
> Program terminated with signal 11, Segmentation fault.
> procfs (find_procinfo):  Couldn't locate pid 0
> #0  0xef636d58 in _end ()
> (gdb) bt
> #0  0xef636d58 in _end ()
> #1  0xef6732cc in _end ()
> #2  0xef6475f8 in _end ()
> #3  0xef6f2158 in _end ()
> #4  0xef6d77c8 in _end ()
> #5  0xef6f1bb8 in _end ()
> #6  0x1fec0 in stachmyaddress (host=0x44fc7 "koi.smtp.online.ru")
>     at selfaddrs.c:296
> #7  0x2011c in stachmyaddresses (host=0x44fd9 "") at selfaddrs.c:420
> #8  0x16604 in smtpconn (SS=0xeffff998, host=0x454b0 "office.sob.tulane.edu", 
>     noMX=0) at smtp.c:1758
> #9  0x163b0 in smtpopen (SS=0xeffff998, host=0x454b0 "office.sob.tulane.edu", 
>     noMX=0) at smtp.c:1689
> #10 0x13e28 in main (argc=0, argv=0x454b0) at smtp.c:669
> (gdb) 
> 
> "koi.smtp.online.ru" is one of IP aliases of the local machine.
> selfaddrs.c:296 is gethostbyname() call.  I *can* beleive that this is
> an error in Solaris...  Though this happend in 2.4 as well.
> Also, I *can* write a wrapper around gethostbyname doing alarm(),
> but it sounds ugly.
> 
> Any better ideas?
> Maybe the scheduler could kill letargic childs?  Something else?

	Umm..  It does have the requisite trace information available
	for detecting non-active childs...  Yes, it is possible.

	Now what is the longest time any process may be active waiting
	for a remote system ?  (or subprocess in case of local pipe..)
 
> Eugene

	/Matti Aarnio