Another debug run with the arla lockup problem

Neulinger, Nathan R. nneul at umr.edu
Wed Feb 17 18:11:17 CET 1999


I can get this particular problem to reproduce itself 100% reliably.

Sequence of actions:
	1. Boot machine
	2. Clear cache completely
	3. Start arla
	4. Enable arla and xfs debugging
	5. Start web server
(the above is all in rc startup stuff)
	6. log in as root and run ~websrch/bin/run-build

Symptom: run-build sits there eating cpu, but not doing anything

Contents of run-build:
----------------------
#!/bin/csh
set echo
setenv LOG /local/logs/build-`date +%Y%m%d`.log
find /local/logs -mtime +4 -exec rm -f {} \;
rm -f $LOG
touch $LOG
/umr/s/websrch/bin/build >& $LOG </dev/null &
----------------------

build is another script. No output is ever written to LOG and it appears
that build is never started, even though run-build does exit and continue in
the background. 

As indicated in the debug trace
(http://www.umr.edu/~nneul/debug-traces/arla-webindex-19990217.gz), arla
seems to get into a state where it spins doing getnode/installnode with no
xfs logging activity. That is around line 1000 of the log. It looks like the
last xfs activity taking place is an xfs_node_find.

This is running the 2-13 snapshot of arla with my setgroups patch.

Oh, BTW, something else - on this machine if I leave it running, at some
point it seems to get into a state where it spins doing clear_all_childs.
(Can't get into it to get the debug output and console spinning too fast to
get any more details.)

993  Feb 17 10:25:15 webindex kernel: xfs_node_find
994  Feb 17 10:25:15 webindex kernel: xfs_message_installnode: dp: c35226a8
995  Feb 17 10:25:15 webindex kernel: xfs_message_installnode: fetching new
node
996  Feb 17 10:25:15 webindex kernel: new_xfs_node 0.536956207.219.1050
997  Feb 17 10:25:15 webindex kernel: xfs_node_find
998  Feb 17 10:25:15 webindex arla[350]: worker 0: processing
999  Feb 17 10:25:15 webindex arla[350]: Rec message: opcode = 4 (getnode),
s

Looking at the code, it looks like it never gets past new_xfs_node in the
following.

    XFSDEB(XDEBMSG, ("xfs_message_installnode: fetching new node\n"));
    n = new_xfs_node(&xfs[fd], &message->node); /* VN_HOLD's */
    XFSDEB(XDEBMSG, ("xfs_message_installnode: inode: %p aliases: ",
             XNODE_TO_VNODE(n)));

Now, if xfs_node_find were to crash or never return - would arlad get
confused?

I'm thinking of adding a bunch of debug statements scattered around the that
area of the code to see if I find anything. 

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul at umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216 





More information about the Arla-drinkers mailing list