crash on NetBSD

Mon Jan 15 14:51:53 CET 2007

"Tracy Di Marco White" <gendalia at gmail.com> writes:
> I salvaged several volumes, and the process left hundreds of
> __ORPHANFILE_.#*.#* in the root of the volumes involved.  Since that day,
> two of our webservers have started having arla fail. The two machines
> involved are running NetBSD 2.0.2/i386, and both are now running arla
> 0.43.
>
Tricky one.  Some random thoughts:

Try checking out arla-0-44-branch from cvs, there are a few changes to
lib/bufdir that may very well be relevant.

I take it that this is not happening every time you access those volumes?
If it happens within some reasonable amount of time (before your disk is
full), you could run arlad with --tracefile=arla.trace (ends up in the
chroot dir) and see if you can see any interesting operations just before
the crash (use nnpfs/readtrace.py).

Are there any modifying operations, like rename or unlink, going on?
Especially cross cell renames.

Maybe some more details from gdb could be interesting, like if the page*
addresses are totally broken, if the fbuf and the directory has the same
idea about size, ... Oh, and look at workers[] (0-15 or so?) to see what
they are doing.

I assume the salvage operation produced coherent and sane volumes,
especially since none of my code was involved ;) But one could run
arla-obj/arlad/afsdir_check (not installed) on directories in the cache,
like 02/2C in your second trace, just to see if they look good.

/t