Bonnie confuses arla

Friedrich Delgado Friedrichs delgado at dfn-cert.de
Thu Dec 16 16:15:53 CET 2004


Hi!

Excuse the long delay, but other matters are more pressing for me
right now. However I'm definitely interested to resolve this matter,
even if it takes some time.

Tomas Olsson schrieb:
> So bonnie basically writes directly to the cache file, which is
> fast. Then the cache volume is filled, before arlad is informed of
> the file size change or has a chance to write back the data to the
> file server. Thus the error message and full partition.

Your explanation is interesting. If I understand you correctly, every
time a program tries to write a file that is larger than the cache,
this situation will occur. However I was not able to reproduce exactly
that problem via "cat /dev/zero > /afs/[...]/toolarge". The cat
process hangs in iowait for some while, but is eventually released and
leaves the "toolarge" file in afs space where it can be deleted by the
user or flushed.

However bonnie leaves no files and the partition is still full, arla
seems to be ignorant about that.

Is there some way for the user to mitigate the full partition effect?

fs flush, fs flushv, sync have no effect. The partition remains full.
There are no files, so the user can't delete anything to make room.
The only thing that works for me is restart arla. I suspect that this
is a bug. The user should not be able to screw the client up in such a
way that a reboot (or restart of a vital daemon) is needed.

> The "No such device" usually means that arlad has died. But I would
> expect the sockets to be released when arlad crashes. If they don't,
> we should probably figure out why. Is it easy to reproduce? Does it
> always happen when arlad dies?

It's apparently very difficult to reproduce. I just tried to reproduce
the oops and arlad death but I only managed to reproduce the partition
full effect (as above).

Also the oops seems very difficult to reproduce. I'm not able to
devote much time to this problem, but if I manage to get some more
information, I will notify this list. Maybe the general instability of
the "stable" 2.6 Linux kernel series is responsible for the oops and
not arla.

> Arla comes with it's own test suite, try going into your build tree,
> cd tests, set $WORKDIR and run ./run-tests -all -fast. Skip -fast if
> you want it to take some time, or select individual tests for
> narrowing down.

I tried this. Since we don't (and won't) mount /afs/stacken.kth.se,
all the tests that try to access this cell fail.

Other tests that fail are:

2004-12-13 18:05:11 - Running strange-other-characters
Test strange-other-characters FAILED
2004-12-13 18:05:15 - Running checkpwd
Test checkpwd FAILED

I don't know what this signifies.

Kind regards
     Friedel
--
Friedrich Delgado Friedrichs (IT-Services), DFN-CERT Services GmbH
https://www.dfn-cert.de/, +49 40 808077-555 (Hotline)
12. DFN-CERT Workshop und Tutorien, CCH Hamburg, 2-3. Maerz 2005
Infos/Anmeldung unter: https://www.dfn-cert.de/events/ws/2005/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgp00004.pgp
Type: application/octet-stream
Size: 480 bytes
Desc: "PGP signature"
Url : http://lists.stacken.kth.se/pipermail/arla-drinkers/attachments/00000000/c725174e/pgp00004.obj


More information about the Arla-drinkers mailing list