[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ZMailer] Zmailer crashes



Matti, Ralf, Neal,

I'm also getting segfaults from router, a few of them every week.
It's a Debian Linux box:
Linux mail 2.6.31.5 #1 SMP Mon Oct 26 23:42:58 ART 2009 x86_64 GNU/Linux
Zmailer sources from Eugene's CVS.
I've turned on core dumps (ulimit -Sc 51200 in the shell that
starts/restarts the router) but couldn't ever get a core file, however,
if I signal the router process with kill -SIGSEGV pid I get a core file
(useless).
So I decided to run gdb and attach to each router process (router -dkn
2) and wait...
This is what I got when router finally segfaulted.
Can anybody help with this gdb trace?

Thanks
Rodolfo

--------------------------------
(gdb) attach 27737
Attaching to process 27737
Reading symbols from /usr/local/zmailer/bin/router...done.
Reading symbols from /lib/libcrypt.so.1...Reading symbols from
/usr/lib/debug/lib/libcrypt-2.7.so...done.
done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /usr/lib/libdb-4.6.so...done.
Loaded symbols for /usr/lib/libdb-4.6.so
Reading symbols from /usr/lib/libgdbm.so.3...done.
Loaded symbols for /usr/lib/libgdbm.so.3
Reading symbols from /lib/libresolv.so.2...Reading symbols from
/usr/lib/debug/lib/libresolv-2.7.so...done.
done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libc.so.6...Reading symbols from
/usr/lib/debug/lib/libc-2.7.so...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libpthread.so.0...Reading symbols from
/usr/lib/debug/lib/libpthread-2.7.so...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f9ecb9136e0 (LWP 27737)]
done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux-x86-64.so.2...Reading symbols from
/usr/lib/debug/lib/ld-2.7.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f9ecade2960 in __read_nocancel () from /lib/libc.so.6
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f9ecb9136e0 (LWP 27737)]
0x0000000000412ae8 in router (a=0x10c8570, uid=370, type=0x46fae8 "sender",
    senderstr=0x0) at shliaise.c:678
678                     if (p->p_type == anAddress) {
(gdb) where
#0  0x0000000000412ae8 in router (a=0x10c8570, uid=370,
    type=0x46fae8 "sender", senderstr=0x0) at shliaise.c:678
#1  0x00000000004168d8 in thesender (e=0x10c6c88, a=0x10c8570) at
rfc822.c:1175
#2  0x0000000000419c4f in sequencer (e=0x10c6c88, file=0xddf860 "29680333")
    at rfc822.c:1767
#3  0x0000000000413584 in run_rfc822 (argc=2, argv=0x7fff67e50280)
    at rfc822.c:165
#4  0x0000000000449b5a in execute (c=0x7fff67e51a30, caller=0x7fff67e53400,
    oretcode=0, name=0x46dccd "rfc822") at execute.c:397
#5  0x000000000043471a in runcommand (c=0x7fff67e51a30, pc=0x7fff67e53400,
    retcodep=0x7fff67e534dc, cmdname=0x46dccd "rfc822") at interpret.c:762
#6  0x00000000004379b4 in interpret (Vcode=0xdb0130, Veocode=0xdb03cb,
    Ventry=0xdb013f, caller=0x7fff67e53400, retcodep=0x7fff67e534dc,
    cdp=0xdafef0) at interpret.c:1805
#7  0x000000000043bc08 in lapply (fname=0x4713ab "process",
l=0x7f9eca818e58)
    at interpret.c:2881
#8  0x000000000043bcc0 in apply (argc=2, argv=0x7fff67e53910)
    at interpret.c:2905
#9  0x000000000040ff4b in s_apply (argc=2, argv=0x7fff67e53910)
    at shliaise.c:71
#10 0x0000000000429ca1 in rd_doit (filename=0x7fff67e53620 "29680333",
    dirs=0x7fff67e539a8 "") at daemonsub.c:1454
#11 0x0000000000428bf0 in child_server (tofd=0, frmfd=1) at daemonsub.c:799
#12 0x0000000000427c03 in start_child (i=0) at daemonsub.c:276
#13 0x000000000042a3e8 in run_daemon (argc=1, argv=0x7fff67e55d20)
    at daemonsub.c:1652
#14 0x0000000000404db8 in main (argc=3, argv=0x7fff67e55f18) at router.c:419
(gdb)
------------------------
Ralf Baechle escribió:
> On Wed, Feb 04, 2009 at 03:09:55PM +0000, Ralf Baechle wrote:
> 
>>> On Fri, Jan 30, 2009 at 03:32:31PM -0800, Neal Morgan wrote:
>>>>> On October 31, 2008 9:03 AM Ralf Baechle wrote
>>>>> Since quite a while I'm observing these kernel messages on a Linux x86_64
>>>>> system:
>>>>>
>>>>> sm[3270]: segfault at 3ba7f9f0 ip 79fbc9 sp 7fffe7c48e30 error 6 in
>>>> libc-2.7.so[72d000+14d000]
>>>>> sm[3493] trap stack segment ip:7f0e2a121bc9 sp:7fff3240e4a0 error:0
>>>>> sm[3773]: segfault at 3ba7f9f0 ip 79fbc9 sp 7fff55499680 error 6 in
>>>> libc-2.7.so[72d000+14d000]
>>>>
>>>> Matti: I've been seeing these across 4 servers:
>>>>
>>>> kernel: smtpserver[31693]: segfault at 00000000 eip b7c16371 esp
>>>> bf94b018 error 4
>>>>
>>>> kernel: router[9934]: segfault at 00000008 eip 0807fa95 esp bfdf5570
>>>> error 4
>>>>
>>>> The interesting thing is it only happens when booted into a 2.6.24
>>>> kernel.  If I reboot the same box into a 2.6.18 kernel everything runs
>>>> fine (and there are no segfaults).
>> Older kernels don't emit this segfault message.  It was added in
>> commit abd4f7505bafdd6c5319fe3cb5caf9af6104e17a that is for 2.6.23.  Could
>> that be why you didn't notice it earlier?
>>
>>> I do see them too with 2.6.26 kernel at zmailer.org server.
>>> A few hits per week according to kernel dmesg logs.
>>>
>>> I suspect more about glibc doing something stupid, than program really
>>> going over the edge, but these are so rare that debugging them is next
>>> to impossible.   Previously I have seen them happen after the program
>>> has called exit(0).
>>>
>>> Anyway I have turned on core dumps to be able to see what happens.
>> I've seen Zmailer stopping mail delivery or stopping accepting connections
>> on port 25.  The issue is hitting relativly infrequently but I decieded to
>> follow your example and just turned on core dumps; it is affecting sm,
>> smtpserver and router.  Lately the frequency of this issue striking
>> seems to have increased significantly - I wonder if that's due to me
>> looking more frequently after it or due to my extremly inflated mail
>> queue with over 1,700,000 stored messages.
>>
>> Ironically I seem to have gotten another router segfault just seconds
>> before I enabled core dumps ...
> 
> To close this old case - the issue went away for me after upgrading the
> system from Fedora 8 to Fedora 10.  So I assume there indeed as suspected
> by Matti was something toxic in glibc.
> 
>   Ralf
> --
> To unsubscribe from this list: send the line "unsubscribe zmailer" in
> the body of a message to majordomo@nic.funet.fi

--
To unsubscribe from this list: send the line "unsubscribe zmailer" in
the body of a message to majordomo@nic.funet.fi