arlad crashes/loops (memory corruption?)

Nickolai Zeldovich kolya at mit.edu
Wed Dec 20 23:26:28 CET 2000


For some reason I've been having arlad going into infinite loops
lately (a few times a day, on a moderately loaded machine), and
on one occasion crash with SIGSEGV.. (This is using the current
CVS checkout as of Dec 19th, running on FreeBSD 3.1-RELEASE,
although I see the same looping problem at least with 0.34.6.)

The infinite loop appears to be in rxevent_RaiseEvent() -- it
calls rxi_Start from the event queue but rxi_Start adds itself
back to the queue to be ran immediately (time = now). Looking
at call->conn->peer in gdb, the timeout is indeed zero, but
the rest of it looks munged as well:

(gdb) bt
#0  0x808b126 in IOMGR_Cancel (pid=0x80a8200) at iomgr.c:777
#1  0x8087d3b in rxi_ReScheduleEvents () at rx_user.c:66
#2  0x8087b3f in rxevent_Post (when=0x80eeeb8, func=0x80829d8 <rxi_Start>, 
    arg=0x81e3b00, arg1=0x0) at rx_event.c:139
#3  0x808302e in rxi_Start (event=0x80a7a14, call=0x81e3b00) at rx.c:3182
#4  0x8087cf1 in rxevent_RaiseEvents (next=0x80eef34) at rx_event.c:208
#5  0x8088097 in rxi_Listener () at rx_user.c:283
[...]
(gdb) frame
#3  0x808302e in rxi_Start (event=0x80a7a14, call=0x81e3b00) at rx.c:3182
3182                call->resendEvent = rxevent_Post(&retryTime, rxi_Start, (char *) call, 0);
(gdb) p *call->conn->peer
$1 = {next = 0x0, host = 135369728, port = 63242, packetSize = 14912, 
  idleWhen = 3910018184, refCount = 0, burstSize = 0 '\000', burst = 0 '\000', 
  burstWait = {sec = 0, usec = 136199424}, congestionQueue = {prev = 0x0, 
    next = 0x0}, rtt = 0, rtt_dev = 8, timeout = {sec = 0, usec = 0}, 
  nSent = 1, reSends = 9, inPacketSkew = 8, outPacketSkew = 8, 
  rateFlag = 1472, maxWindow = 0, spare = 0}

The one time arlad crashed with SIGSEGV, it was apparently due
to a bad rx_call pointer on peer->congestionQueue:

(gdb) frame
#0  0x80834f6 in rxi_DecongestionEvent (event=0x80a78c4, peer=0x8119300, 
    nPackets=1) at rx.c:3405
3405        for (queue_Scan(&peer->congestionQueue, call, nxcall, rx_call)) {
(gdb) p peer->congestionQueue
$6 = {prev = 0x396c9136, next = 0x2dd286eb}
(gdb) p *peer->congestionQueue.next
Cannot access memory at address 0x2dd286eb.

Any suggestions for how one might go about debugging such lossage
are certainly welcome :)  I'd be tempted to blame memory problems,
but everything else appears to be running just fine without arlad.
(Debugging is also made harder by the fact that soon after arlad
starts looping, something appears to lock the / vnode, presumably
trying to access /afs, and everything quickly stalls.)

I have the output from "arlad -n --debug=all -z /dev/xfs0" when
it goes into an infinite loop, with some additional information
being printed in the event code, in /afs/zepa.net/user/kolya/arla,
if anyone is interested in taking a look.

-- kolya





More information about the Arla-drinkers mailing list