Bonnie confuses arla

Friedrich Delgado Friedrichs delgado at dfn-cert.de
Tue Dec 7 19:51:09 CET 2004


Hi!

Issuing the command:

bonnie -s 2000 -v 3

in /afs/dfn-cert.de/public/TEST

Leads to the following situation:

delgado at count:/afs/dfn-cert.de/public/TEST$ bonnie -s 2000 -v 3
Bonnie 1.4: File './Bonnie.12312', size: 2097152000, volumes: 3
Writing with putc()...         Bonnie: drastic I/O error (putc): No space left on device

(sometimes also: no such file or directory)

Afterwards, there's very little or no space to write to afs anymore.

delgado at count:/afs/dfn-cert.de/public/TEST$ cat /dev/zero > foo
cat: write error: No space left on device

delgado at count:/afs/dfn-cert.de/public/TEST$ ls -l foo
-rw-r--r--  1 delgado users 0 Dec  7 18:44 foo

The cache partition is full, but arla reports low cache use (with fs
getcacheparms). There are no bonnie files in the test directory.

/dev/sda12           1012M  960M  168K 100% /var/cache/afs

delgado at count:/afs/dfn-cert.de/public/TEST$ fs getcache
Arla is using 57 of the cache's available 1024000 1K byte blocks
(and 67 of the cache's available 40000 vnodes)

Also a different error occurs sometimes:

Bonnie 1.4: File './Bonnie.13536', size: 2097152000, volumes: 3
Writing with putc()...         Bonnie: drastic I/O error (putc): No such device

delgado at count:/afs/dfn-cert.de/public/TEST$ ls /afs
ls: /afs: No such device

arlad dies in this case and can't be restarted, since the listening
socket is still in use. Reboot is the only thing that helps.

Afterwards the Bonnie files are present in the TEST directory, but
have 0 byte length.

Unfortunately it's not possible to backtrace arlad with gdb:

count:~ # gdb /usr/local/arla/libexec/arlad
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i586-suse-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run -z -n
Starting program: /usr/local/arla/libexec/arlad -z -n
[Thread debugging using libthread_db enabled]
[New Thread -1212078112 (LWP 12471)]
2004-12-07 18:46:00 CET: arla: suidcell: cell  doesn't exist in the db
Couldn't get registers: No such process.
(gdb) bt
Cannot fetch general-purpose registers for thread -1212078112: generic error
(gdb)

I was also able to elicit a kernel oops this way:

Dec  7 18:29:34 count kernel: Unable to handle kernel paging request at virtual address e02b8b24
Dec  7 18:29:34 count kernel:  printing eip:
Dec  7 18:29:34 count kernel: c0170bd4
Dec  7 18:29:34 count kernel: *pde = 01566067
Dec  7 18:29:34 count kernel: Oops: 0000 [#1]
Dec  7 18:29:34 count kernel: SMP
Dec  7 18:29:34 count kernel: Modules linked in: nnpfs edd joydev sg st sr_mod nvram thermal processor fan snd_pcm_oss button snd_mixer_oss battery ac snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore af_packet usbhid ipv6 tg3 intel_agp agpgart ehci_hcd uhci_hcd evdev usbcore binfmt_misc subfs dm_mod ide_cd cdrom ext3 jbd ata_piix libata sd_mod scsi_mod
Dec  7 18:29:34 count kernel: CPU:    1
Dec  7 18:29:34 count kernel: EIP:    0060:[prune_dcache+180/416]    Tainted: GF  VLI
Dec  7 18:29:34 count kernel: EIP:    0060:[<c0170bd4>]    Tainted: GF  VLI
Dec  7 18:29:34 count kernel: EFLAGS: 00010282   (2.6.9-1.1_dpath-smp 20041109140641)
Dec  7 18:29:34 count kernel: EIP is at prune_dcache+0xb4/0x1a0
Dec  7 18:29:34 count kernel: eax: e02b8b10   ebx: c60efb70   ecx: c60efb80   edx: c60efb00
Dec  7 18:29:34 count kernel: esi: c670f680   edi: c670f688   ebp: 00000009   esp: c1789f0c
Dec  7 18:29:34 count kernel: ds: 007b   es: 007b   ss: 0068
Dec  7 18:29:34 count kernel: Process kswapd0 (pid: 45, threadinfo=c1788000 task=c17851d0)
Dec  7 18:29:34 count kernel: Stack: c0422c20 00000086 c1788000 df684a20 c0171052 c014893c 0005c300 00000000
Dec  7 18:29:34 count kernel:        00000bb8 0000007d 00000000 000000d0 00000020 00000001 c037a680 c037a680
Dec  7 18:29:34 count kernel:        00000bb7 c0149f36 c1789f9c 00000000 0000000c 00000000 00000000 00000020
Dec  7 18:29:34 count kernel: Call Trace:
Dec  7 18:29:34 count kernel:  [shrink_dcache_memory+18/64] shrink_dcache_memory+0x12/0x40
Dec  7 18:29:34 count kernel:  [<c0171052>] shrink_dcache_memory+0x12/0x40
Dec  7 18:29:34 count kernel:  [shrink_slab+268/400] shrink_slab+0x10c/0x190
Dec  7 18:29:34 count kernel:  [<c014893c>] shrink_slab+0x10c/0x190
Dec  7 18:29:34 count kernel:  [balance_pgdat+534/768] balance_pgdat+0x216/0x300
Dec  7 18:29:34 count kernel:  [<c0149f36>] balance_pgdat+0x216/0x300
Dec  7 18:29:34 count kernel:  [kswapd+197/240] kswapd+0xc5/0xf0
Dec  7 18:29:34 count kernel:  [<c014a0e5>] kswapd+0xc5/0xf0
Dec  7 18:29:34 count kernel:  [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
Dec  7 18:29:34 count kernel:  [<c011fa20>] autoremove_wake_function+0x0/0x30
Dec  7 18:29:34 count kernel:  [ret_from_fork+6/32] ret_from_fork+0x6/0x20
Dec  7 18:29:34 count kernel:  [<c0106d16>] ret_from_fork+0x6/0x20
Dec  7 18:29:34 count kernel:  [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
Dec  7 18:29:34 count kernel:  [<c011fa20>] autoremove_wake_function+0x0/0x30
Dec  7 18:29:34 count kernel:  [kswapd+0/240] kswapd+0x0/0xf0
Dec  7 18:29:34 count kernel:  [<c014a020>] kswapd+0x0/0xf0
Dec  7 18:29:34 count kernel:  [kernel_thread_helper+5/12] kernel_thread_helper+0x5/0xc
Dec  7 18:29:34 count kernel:  [<c0105279>] kernel_thread_helper+0x5/0xc
Dec  7 18:29:34 count kernel: Code: 00 8b 56 3c 8b 48 04 89 4a 04 89 40 04 89 11 89 46 3c 8d 46 08 e8 dd 5e 1a 00 b8 80 4b 3e c0 e8 d3 5e 1a 00 8b 46 48 85 c0 74 0b <8b> 48 14 85 c9 0f 85 b8 00 00 00 89 d8 e8 5a 27 00 00 8b 5e 10
Dec  7 18:29:41 count kernel:  <1>Unable to handle kernel paging request at virtual address e02b8b24
Dec  7 18:29:41 count kernel:  printing eip:
Dec  7 18:29:41 count kernel: c0170bd4
Dec  7 18:29:41 count kernel: *pde = 01566067
Dec  7 18:29:41 count kernel: Oops: 0000 [#2]
Dec  7 18:29:41 count kernel: SMP
Dec  7 18:29:41 count kernel: Modules linked in: nnpfs edd joydev sg st sr_mod nvram thermal processor fan snd_pcm_oss button snd_mixer_oss battery ac snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore af_packet usbhid ipv6 tg3 intel_agp agpgart ehci_hcd uhci_hcd evdev usbcore binfmt_misc subfs dm_mod ide_cd cdrom ext3 jbd ata_piix libata sd_mod scsi_mod
Dec  7 18:29:41 count kernel: CPU:    1
Dec  7 18:29:41 count kernel: EIP:    0060:[prune_dcache+180/416]    Tainted: GF  VLI
Dec  7 18:29:41 count kernel: EIP:    0060:[<c0170bd4>]    Tainted: GF  VLI
Dec  7 18:29:41 count kernel: EFLAGS: 00010282   (2.6.9-1.1_dpath-smp 20041109140641)
Dec  7 18:29:41 count kernel: EIP is at prune_dcache+0xb4/0x1a0
Dec  7 18:29:41 count kernel: eax: e02b8b10   ebx: deb0d2b0   ecx: deb0d2c0   edx: deb0d200
Dec  7 18:29:41 count kernel: esi: dacc0560   edi: dacc0568   ebp: 00000080   esp: c2ab7e10
Dec  7 18:29:41 count kernel: ds: 007b   es: 007b   ss: 0068
Dec  7 18:29:41 count kernel: Process arlad (pid: 13475, threadinfo=c2ab6000 task=da1badf0)
Dec  7 18:29:41 count kernel: Stack: c0422c20 00000083 c2ab6000 df684a20 c0171052 c014893c 000f5500 00000000
Dec  7 18:29:41 count kernel:        0001c753 00000008 00000000 000000d0 00000020 0000000c c2ab7eb4 c037e280
Dec  7 18:29:41 count kernel:        00000000 c0149c54 0001c752 00000000 000000d0 00000020 00000020 00000020
Dec  7 18:29:41 count kernel: Call Trace:
Dec  7 18:29:41 count kernel:  [shrink_dcache_memory+18/64] shrink_dcache_memory+0x12/0x40
Dec  7 18:29:41 count kernel:  [<c0171052>] shrink_dcache_memory+0x12/0x40
Dec  7 18:29:41 count kernel:  [shrink_slab+268/400] shrink_slab+0x10c/0x190
Dec  7 18:29:41 count kernel:  [<c014893c>] shrink_slab+0x10c/0x190
Dec  7 18:29:41 count kernel:  [try_to_free_pages+228/432] try_to_free_pages+0xe4/0x1b0
Dec  7 18:29:41 count kernel:  [<c0149c54>] try_to_free_pages+0xe4/0x1b0
Dec  7 18:29:41 count kernel:  [__alloc_pages+543/832] __alloc_pages+0x21f/0x340
Dec  7 18:29:41 count kernel:  [<c014254f>] __alloc_pages+0x21f/0x340
Dec  7 18:29:41 count kernel:  [__get_free_pages+24/48] __get_free_pages+0x18/0x30
Dec  7 18:29:41 count kernel:  [<c0142688>] __get_free_pages+0x18/0x30
Dec  7 18:29:41 count kernel:  [__pollwait+100/160] __pollwait+0x64/0xa0
Dec  7 18:29:41 count kernel:  [<c016c3c4>] __pollwait+0x64/0xa0
Dec  7 18:29:41 count kernel:  [datagram_poll+18/192] datagram_poll+0x12/0xc0
Dec  7 18:29:41 count kernel:  [<c029e262>] datagram_poll+0x12/0xc0
Dec  7 18:29:41 count kernel:  [sock_poll+18/32] sock_poll+0x12/0x20
Dec  7 18:29:41 count kernel:  [<c02987b2>] sock_poll+0x12/0x20
Dec  7 18:29:41 count kernel:  [do_select+482/800] do_select+0x1e2/0x320
Dec  7 18:29:41 count kernel:  [<c016c6d2>] do_select+0x1e2/0x320
Dec  7 18:29:41 count kernel:  [__pollwait+0/160] __pollwait+0x0/0xa0
Dec  7 18:29:41 count kernel:  [<c016c360>] __pollwait+0x0/0xa0
Dec  7 18:29:41 count kernel:  [sys_select+760/1232] sys_select+0x2f8/0x4d0
Dec  7 18:29:41 count kernel:  [<c016cb28>] sys_select+0x2f8/0x4d0
Dec  7 18:29:41 count kernel:  [sysenter_past_esp+82/121] sysenter_past_esp+0x52/0x79
Dec  7 18:29:41 count kernel:  [<c0106dc9>] sysenter_past_esp+0x52/0x79
Dec  7 18:29:41 count kernel: Code: 00 8b 56 3c 8b 48 04 89 4a 04 89 40 04 89 11 89 46 3c 8d 46 08 e8 dd 5e 1a 00 b8 80 4b 3e c0 e8 d3 5e 1a 00 8b 46 48 85 c0 74 0b <8b> 48 14 85 c9 0f 85 b8 00 00 00 89 d8 e8 5a 27 00 00 8b 5e 10
Dec  7 18:31:54 count arla[13577]: rx_init: Unknown error 4294967289
Dec  7 18:31:57 count kernel:  VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
Dec  7 18:31:59 count kernel: Found sys_call_table at c0375a40
Dec  7 18:32:00 count arla[13595]: rx_init: Unknown error 4294967289
Dec  7 18:32:24 count kernel: Found sys_call_table at c0375a40
Dec  7 18:32:25 count arla[13616]: rx_init: Unknown error 4294967289
Dec  7 18:32:29 count arla[13623]: rx_init: Unknown error 4294967289
Dec  7 18:32:45 count arla[13640]: rx_init: Unknown error 4294967289
Dec  7 18:33:34 count arla[13666]: rx_init: Unknown error 4294967289
Dec  7 18:33:46 count arla[13671]: rx_init: Unknown error 4294967289

I suspect this bug might be triggered under different circumstances as
well (i.e. when not using bonnie) since there are occasional oopses on
shutdown.

We're using arla 0.37 (the same bug was present in 0.36.2) and Vanilla
kernel 2.6.9 (with a small patch from andrew morton to circumvent and
trace a problem with dpath, see
http://gossamer-threads.com/lists/linux/kernel/496626?do=post_view_threaded)
on a SuSE 9.1 installation. We're using the openafs 1.2.11 and heimdal
0.6.1rc3 rpms from the SuSE distribution on the server and the
clients.

If you need any further information, debug output, whatever, please
feel free to ask.

Kind regards and thanks in advance
     FDF
--
Friedrich Delgado Friedrichs (IT-Services), DFN-CERT Services GmbH
https://www.dfn-cert.de/, +49 40 808077-555 (Hotline)
12. DFN-CERT Workshop und Tutorien, CCH Hamburg, 2-3. Maerz 2005
Infos/Anmeldung unter: https://www.dfn-cert.de/events/ws/2005/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgp00001.pgp
Type: application/octet-stream
Size: 480 bytes
Desc: "PGP signature"
Url : http://lists.stacken.kth.se/pipermail/arla-drinkers/attachments/00000000/26af2a94/pgp00001.obj


More information about the Arla-drinkers mailing list