Error 'Software caused connection abort' in arla 0.26 and 0.25

Sat Jul 24 18:50:49 CEST 1999

On 24 Jul 1999, Assar Westerlund wrote:

> Dr A V Le Blanc <LeBlanc at mcc.ac.uk> writes:
> > As you see, it takes about 10 seconds to work properly, and
> > the first four times it fails with this funny error message.
> > The message is not from anywhere in arla directly.  The error
> > number ECONNABORTED is defined in errno.h.  This is on Linux
> > 2.2.10 with arla 0.26, but also occured with 2.2.9 and 0.25.
> 
> As you saw yourself, there's nowhere in arla or xfs where ECONNABORTED
> should be returned.  And looking around int the Linux source it only
> seems to happen in inet_accept and ip_fw_ctl, neither of which should
> apply here.  It could however, be a VNOVOL error that doesn't get
> translated properly.  Can you turn on arla debugging (with `fs
> arladebug all') and then try accessing some file that gives you these
> errors?  That way we might be able to figure out where the error is
> occuring.

Any time an operation involving AFS returns what appears to be a system
error in the low 100's, my first thought is always that it might actually
be a volume package error (which presently range from 101 to 112).  These
errors are all listed in <afs/errors.h>.

I agree, this could easily be VNOVOL, which makes quite a bit more sense
than ECONNABORTED.  Offhand, my guess would be that this indicates a case
where the VLDB says a volume is on a particular fileserver, but either
that server doesn't have a copy of that volume, or it is offline for some
reason.  The fact that you can access the file at all further suggests
that this is a replicated volume, and only some RO sites are affected.
I would suggest examining the volume in question to determine whether
there is some problem.

Of course, Arla should also be fixed - when a fileserver housing one copy
of a replicated volume returns VNOVOL, the client should attempt to find
another copy of the volume.  I'm pretty sure Arla does this already, but
it's an uncommon enough situation that it could be broken and no one would
notice for some time.

Indeed, in arlad/fcache.c:try_next_fs(), the handling of VMOVED and VNOVOL
changed between 0.22 and 0.25 (versions I happen to have on hand).
Previously, if a fileserver returned VMOVED or VNOVOL, arla would try the
next fileserver, if any.  Now, it gives up on the call immediately, but
then updates its volume cache and tries again.  The new behaviour is
correct for VMOVED, but IMNSHO try_next_fs() should still return TRUE for
VNOVOL, since we could be talking about an RO site which doesn't have an
online copy of the volume, and I believe the current code will retry such
a site forever.

In any case, I don't think that's your problem -- if that code were broken
_and_ leaked an error code, it would likely leak ARLA_VNOVOL (4103), not
VNOVOL (103).  I believe the real problem in this case is that the error
code translation is not happening, and so the special handling for VNOVOL
is not happening.  I'll forward more details when I'm more sure of what's
going on.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+ at cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA