another arlad crash on netbsd

Love lha at stacken.kth.se
Sun Jan 3 23:51:46 CET 1999


Ken Raeburn <raeburn at raeburn.org> writes:

> I was running a "du" across a modem line (ppp) that probably had a
> bunch of other traffic as well (mail & news downloads, X11), and when
> I went to look at the output, after some numbers for the first many
> directories, I saw a lot of "network is down" messages for individual
> files, then:
> 
>     du: ./.mh/save/1610: Network is down
>     du: ./.mh/save/1616: Network is down
>     du: ./.mh/save/.mh_sequences: Network is down
>     751     ./.mh/save
>     du: ./.mh/Zephyr: Operation not supported by device
>     du: ./.mh/ANSI_C: Not a directory
>     du: ./.mh/tcl: Not a directory
> 
> The "not a directory" stuff seems to come up when arlad isn't running,
> so I'm guessing that that's the point when it crashed, and the
> "network is down" came from having a heavy load on the ppp link, but
> I'm just guessing.

The "Operation not supported by device" and "Not a directory" comes from
the dead vnode that xfs created when arlad died.

What probably have happened is that that your fileserver respons get throw to
arlad to late and arlad considers the fileserver down, and later when
arlad cleans the the cache the cleaner thread never checks the returnvalue
of conn_get().

Guess that there need to be some way to tell arla to retry forever when
you sits behind a slow link and really want your files. It should probably
not be too hard to fix that.

The below patch should fix this problem with arlad dieing. (But not the
retry stuff).
 
> And entry->host does correspond to the host holding the volume I was
> examining.  (However, using Transarc "fs whereis" on the volume after
> restarting arlad, I get a backwards IP address printed out,
> "30.0.185.18" when it should presumably be "18.185.0.30" or
> "cronos.mit.edu".  Perhaps AFS and Arla are using different byte
> orders for that datum.)

We always assume that data is in network order when passed between
arla and fs. The documentation does not say anything about the hostorder.
Guess you could use arla's fs instead.

Love


Index: fcache.c
===================================================================
RCS file: /usr/local/cvsroot/arla/arlad/fcache.c,v
retrieving revision 1.176
diff -u -w -u -w -r1.176 fcache.c
--- fcache.c	1999/01/03 05:25:39	1.176
+++ fcache.c	1999/01/03 22:26:11
@@ -449,6 +449,7 @@
 			 FS_SERVICE_ID, fs_probe, ce);
 	cred_free (ce);
 
+	if (conn != NULL) {
 	fids.len = cbs.len = 1;
 	fids.val = &entry->fid.fid;
 	cbs.val  = &entry->callback;
@@ -457,6 +458,7 @@
 	if (ret)
 	    arla_warn (ADEBFCACHE, ret, "RXAFS_GiveUpCallBacks");
     }
+    }
     volcache_free (entry->volume);
     entry->volume = NULL;
 /*    entry->inode  = 0;*/
@@ -1417,6 +1419,8 @@
 	    conn = conn_get (entry->fid.Cell, entry->host, afsport,
 			     FS_SERVICE_ID, fs_probe, ce);
 	    cred_free (ce);
+
+	    if (conn != NULL) {
 	    fids.len = cbs.len = 1;
 	    fids.val = &entry->fid.fid;
 	    cbs.val  = &entry->callback;
@@ -1427,6 +1431,7 @@
 		arla_warn (ADEBFCACHE, ret, "RXAFS_GiveUpCallBacks");
 	}
     }
+    }
     return 0;			/* XXX */
 }
 
@@ -1457,7 +1462,9 @@
 
 	    conn = conn_get (entry->fid.Cell, entry->host, afsport,
 			     FS_SERVICE_ID, fs_probe, ce);
+	    cred_free (ce);
 
+	    if (conn != NULL) {
 	    ret = RXAFS_FetchStatus (conn->connection,
 				     &entry->fid.fid,
 				     &status,
@@ -1470,7 +1477,7 @@
 			      rx_HostOf(rx_PeerOf (conn->connection)),
 			      ce->cred);
 	    conn_free (conn);
-	    cred_free (ce);
+	    }
 	}
     }
     return 0;			/* XXX */





More information about the Arla-drinkers mailing list