arlad crashes/loops (memory corruption?)

Love lha at stacken.kth.se
Thu Dec 21 16:50:42 CET 2000


Nickolai Zeldovich <kolya at mit.edu> writes:

> For some reason I've been having arlad going into infinite loops
> lately (a few times a day, on a moderately loaded machine), and
> on one occasion crash with SIGSEGV.. (This is using the current
> CVS checkout as of Dec 19th, running on FreeBSD 3.1-RELEASE,
> although I see the same looping problem at least with 0.34.6.)

I've seen weird memory/corruption problems with Linux/i386 (also in rx) and
MOX Public Beta (in util/List.c stuff). The later it almost feels like
blaming the compiler (we haven't had the time to update and test yet).
 
We have stopped supporting fbsd 3.x, xfs will probably break some day soon
for you. Any special reson you hasn't upgraded to 4.x ?

> The infinite loop appears to be in rxevent_RaiseEvent() -- it
> calls rxi_Start from the event queue but rxi_Start adds itself
> back to the queue to be ran immediately (time = now). Looking
> at call->conn->peer in gdb, the timeout is indeed zero, but
> the rest of it looks munged as well:
>
[...]
>
> (gdb) p *call->conn->peer
> $1 = {next = 0x0, host = 135369728, port = 63242, packetSize = 14912, 
>   idleWhen = 3910018184, refCount = 0, burstSize = 0 '\000', burst = 0 '\000', 
>   burstWait = {sec = 0, usec = 136199424}, congestionQueue = {prev = 0x0, 
>     next = 0x0}, rtt = 0, rtt_dev = 8, timeout = {sec = 0, usec = 0}, 
>   nSent = 1, reSends = 9, inPacketSkew = 8, outPacketSkew = 8, 
>   rateFlag = 1472, maxWindow = 0, spare = 0}

It might be interesting that some stuff contains invalid values that look
like pointers (like call->conn->peer.{burstWait.usec,host}).

> The one time arlad crashed with SIGSEGV, it was apparently due
> to a bad rx_call pointer on peer->congestionQueue:
>
[...]
> 
> Any suggestions for how one might go about debugging such lossage
> are certainly welcome :)  I'd be tempted to blame memory problems,
> but everything else appears to be running just fine without arlad.
> (Debugging is also made harder by the fact that soon after arlad
> starts looping, something appears to lock the / vnode, presumably
> trying to access /afs, and everything quickly stalls.)

You want to try to close kernel_fd (``p close(kernel_fd)'' in gdb, this
should unlocks all held vnode-locks.
 
> I have the output from "arlad -n --debug=all -z /dev/xfs0" when
> it goes into an infinite loop, with some additional information
> being printed in the event code, in /afs/zepa.net/user/kolya/arla,
> if anyone is interested in taking a look.

I'll try to look at it later.

Love

PS

Index: CellServDB
===================================================================
RCS file: /afs/stacken.kth.se/src/SourceRepository/arla/conf/CellServDB,v
retrieving revision 1.47
diff -u -w -u -w -r1.47 CellServDB
--- CellServDB	2000/09/18 15:04:14	1.47
+++ CellServDB	2000/12/21 15:13:40
@@ -722,3 +722,4 @@
 >ies.auc.dk     	# Aalborg Univ., Inst. of Electronic Systems, Denmark
 130.225.51.73			#afsdb1.kom.auc.dk
 130.225.51.74			#afsdb2.kom.auc.dk
+>zepa.net		# Zepa

?





More information about the Arla-drinkers mailing list