[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mailbox files with "hole" on NFS, linux 2.6
On Wed, Jun 22, 2005 at 04:44:24PM +0400, Eugene Crosser wrote:
> I observe a problem that may be not Zmailer related, but maybe Matti or
> somebody have any ideas.
> I keep mailboxes on NFS for may years. NFS server was perfiously Sorais
> 7, more recently Linux 2.6.x. Clients are Linux boxes, shares are
> mounted with nolock,nfsvers=3. Clients where running 2.4x kernel for
> years, with no problems. In the past, I tried some early 2.6.x kernel
> on the clients, and these days I installed 2.6.12.
> On my past try to use 2.6 kernel I noticed a few cases of very
> unpleasant data corruption, and apparently the same happend today.
> Occationally a mailbox is noticed with big size, with a "home" in the
> middle. it starts as normal malbox, then there are lots of zeroes, and
> before the end there is more normal mailbox data. Apparently zeroes are
> not real:
NFSv3 failures, hmm.. some recent comments at linux-kernel
development list have reported various odd problems with it..
Could you determine exact byte offsets of where the hole begins,
and where it ends ? The hole size, and its edge offsets are
indicative about possible fault paths.
Some reports even told that mounting with TCP protocol did
allow perfect functioning without errors appearing, while
default of using UDP did fail every now and then...
Listed maintainer information for NFS client is:
P: Trond Myklebust
> $ ls -l /var/virtual/online.ru/mail/F/U/uspvpr.broken
> -rw------- 1 root root 13783233 Jun 21 21:32
> $ du /var/virtual/online.ru/mail/F/U/uspvpr.broken
> 16 /var/virtual/online.ru/mail/F/U/uspvpr.broken
> there is 16K of real data, the rest is a "hole".
> I have TA_USE_MMAP=0 everywhere.
> The problem happens rarely. Still it looks rather bad.
> The hole is observable on the nfs server machine as well as on clients.
The local-delivery process isn't the only one modifying
your mailbox files. Also your POP/IMAP server modifies them.
Running without fcntl-locking ... you are using dot-locking ?
The NFS protocol is staleless, so every read and write
supplies knowledge of at what offset the operation is
about to happen, and therefore the most likely place for
the "wrong offset bug" is at the client side.
Next likely is bug in the NFS server code.
With present kernel implemented NFS server some of previous
bugs are no longer present, but who knows what else happens...
A single bit corruption in network protocol is also possible,
and in early days of SunOS and NFS, it was customary of NOT
doing UDP checksums for performance reasons, and NFS got very
bad reputation... Single bit corruption is also possible
in computer memory, but I trust you are paranoid enough
to run with ECC memory and all checks and alerts active ?
Bit corruptions at network level (ethernet, for example)
are rare, even rarer when happeing in bundles so that
ethernet CRC doesn't detect it, nor UDP checksum..
> Any ideas (other than rollback to 2.4, which I'll probably
> have to do if no better solution is found)?
/Matti Aarnio <firstname.lastname@example.org>
To unsubscribe from this list: send the line "unsubscribe zmailer" in
the body of a message to email@example.com