Revisited: CPU Utilization by the mogilefsd process
Jared Klett
jared at blip.tv
Thu Feb 22 23:33:06 UTC 2007
hi all,
I, too, have experienced this issue while running svn740. I
tried restarting mogilefsd, made sure it was calm, and then started
inserting a few large files. I inserted a 981 MB file, and mogilefsd
started taking 100% CPU again.
my versions of Danga::Socket and other dependencies are the
latest as of a couple weeks ago.
in an effort to make this a more easily digestible e-mail, I
pieced through the strace output and prepared a file without the
duplicated lines:
http://pokkari.com/mogile/file4-strace-snip.txt
here is the raw, uncut strace (about 3 MB of text):
http://pokkari.com/mogile/file4-strace-raw.txt
and the output of lsof -p 28553:
http://pokkari.com/mogile/file4-lsof.txt
here's the relevant info from uname, top, and lsof:
uname -a
Linux file4 2.6.18-gentoo-r6 #1 SMP Thu Jan 4 16:48:58 EST 2007 x86_64
Intel(R) Pentium(R) D CPU 2.80GHz GNU/Linux
top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28553 mogile 25 0 44004 17m 2696 R 99 0.9 5:10.58 mogilefsd
28553 mogile 25 0 44004 17m 2696 R 99 0.9 5:25.47 mogilefsd
28572 mogile 15 0 51148 19m 1928 S 5 1.0 2:58.78 mogilefsd
16190 root 16 0 490m 466m 1904 S 1 23.2 551:25.01 mogstored
28565 mogile 15 0 43796 16m 1720 S 0 0.8 0:30.05 mogilefsd
28568 mogile 15 0 43596 16m 1740 S 0 0.8 0:10.44 mogilefsd
28570 mogile 15 0 43452 15m 1740 S 0 0.8 0:10.29 mogilefsd
28571 mogile 15 0 43600 16m 1740 S 0 0.8 0:13.31 mogilefsd
28573 mogile 15 0 42500 14m 1468 S 0 0.7 0:14.71 mogilefsd
cheers,
- Jared
--
Jared Klett
Co-founder, Blip.tv
646.526.8948 (cell)
JaredAtWrok (aim)
http://blog.blip.tv
-----Original Message-----
From: mogilefs-bounces at lists.danga.com
[mailto:mogilefs-bounces at lists.danga.com] On Behalf Of mischa.z
Sent: Wednesday, February 21, 2007 11:37 AM
To: mogilefs at lists.danga.com
Subject: Revisited: CPU Utilization by the mogilefsd process
Hi,
I am still experiencing the CPU utilization problem that was mentioned
on the list and got patched last week with svn740 (mogilefsd busywait
bug). I'm starting to wonder if my problem is unrelated to the reported
one. I've collected some debug info that hopefully gives you guys a clue
of what's going on. The tracker I am using is running on a test machine.
No client was using the tracker at the point when the CPU load went up
(which basically happened shortly after mogilefsd got started).
Our live system is using svn518 which is fine but we run into the same
behaviour there when using the newest version.
Any ideas?
Thanks,
Mischa
uname -a
Linux 2.6.10 #2 SMP Wed Feb 16 19:18:05 UTC 2005 i686 GNU/Linux
svn info
Path: .
URL: http://code.sixapart.com/svn/mogilefs/trunk
Repository UUID: f67b2e87-0811-0410-a7e0-dd94e48410d6
Revision: 740
top:
6825 mog 21 0 13288 10m 2288 R 90.7 0.9 1:49.45 mogilefsd
6998 mog 16 0 14876 12m 2388 R 10.9 1.0 0:15.66 mogilefsd
6827 mog 15 0 14940 12m 2312 S 3.6 1.0 0:05.59 mogilefsd
6840 mog 15 0 13288 10m 2296 S 1.6 0.9 0:02.23 mogilefsd
6826 mog 15 0 13156 10m 2280 S 1.3 0.8 0:01.80 mogilefsd
6835 mog 15 0 13156 10m 2280 S 1.3 0.8 0:01.30 mogilefsd
6836 mog 15 0 13156 10m 2280 S 1.3 0.8 0:01.31 mogilefsd
...
log:
beginning run
Job checker has only 0, wants 1, making 1.
Job replicate has only 0, wants 1, making 1.
Job delete has only 0, wants 1, making 1.
Job queryworker has only 0, wants 10, making 10.
Job monitor has only 0, wants 1, making 1.
Job reaper has only 0, wants 1, making 1.
[replicate(6827)] no_source: Source is no longer available replicating
1294283
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 44: took
2.00 seconds out of 2 allowed [monitor(6839)] Timeout contacting machine
10.0.10.24 for dev 43: took 2.00 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 50: took
2.00 seconds out of 2 allowed [monitor(6839)] Timeout contacting machine
10.0.10.24 for dev 45: took
2.01 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 46: took
2.01 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 44: took
2.00 seconds out of 2 allowed [monitor(6839)] Timeout contacting machine
10.0.10.24 for dev 43: took 2.00 seconds out of 2 allowed crash log:
Error writing: Resource temporarily unavailable at
/usr/local/share/perl/5.8.4/MogileFS/Worker.pm line 89.
Child 6839 (monitor) died: 256 (UNEXPECTED) Use of uninitialized value
in hash element at /usr/local/share/perl/5.8.4/MogileFS/ProcManager.pm
line 663.
Job monitor has only 0, wants 1, making 1.
[replicate(6827)] no_source: Source is no longer available replicating
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 46: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 46: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 47: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 45: took
2.01 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 50: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 51: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 49: took
2.01 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 48: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 49: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 44: took
2.00 seconds out of 2 allowed [replicate(6827)] no_source: Source is no
longer available replicating
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 51: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 43: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 52: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 43: took 2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 51: took
2.00 seconds out of 2 allowed [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 45: took 2.00 seconds out of 2 allowed
[replicate(6827)] no_source: Source is no longer available replicating
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 49: took
2.00 seconds out of 2 allowed [replicate(6827)] Fid 1361265 unreachable
while replicating: Error:
Resource http://10.0.10.24:7500/dev51/0/001/361/0001361265.fid failed:
HTTP 404
[replicate(6827)] Failed copying fid 1361265 from devid 51 to devid 36
(error type: src_error) [replicate(6827)] source_down: No devices
available replicating 1361265 [monitor(6998)] Timeout contacting machine
10.0.10.24 for dev 53: took 2.00 seconds out of 2 allowed
lsof -p 6825
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
mogilefsd 6825 mog cwd DIR 3,1 4096 916492
/etc/mogilefs
mogilefsd 6825 mog rtd DIR 3,1 4096 2 /
mogilefsd 6825 mog txt REG 3,1 1057324 605198
/usr/bin/perl
mogilefsd 6825 mog mem REG 3,1 692456 1161022
/usr/lib/libdb3.so.3.0.2
mogilefsd 6825 mog mem REG 3,1 34748 1602542
/lib/tls/libnss_files-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 17860 3090541
/lib/libnss_db-2.2.so
mogilefsd 6825 mog mem REG 3,1 73304 1602539
/lib/tls/libnsl-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 67468 1161241
/usr/lib/libz.so.1.2.2
mogilefsd 6825 mog mem REG 3,1 243592 1161395
/usr/lib/libmysqlclient.so.12.0.0
mogilefsd 6825 mog mem REG 3,1 94800 1618856
/usr/lib/perl5/auto/DBD/mysql/mysql.so
mogilefsd 6825 mog mem REG 3,1 139892 1622559
/usr/lib/perl5/auto/DBI/DBI.so
mogilefsd 6825 mog mem REG 3,1 24220 3548791
/usr/local/lib/perl/5.8.4/auto/BSD/Resource/Resource.so
mogilefsd 6825 mog mem REG 3,1 31548 3908282
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
mogilefsd 6825 mog mem REG 3,1 15556 3909133
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
mogilefsd 6825 mog mem REG 3,1 15864 3909131
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
mogilefsd 6825 mog mem REG 3,1 6428 3909129
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
mogilefsd 6825 mog mem REG 3,1 9688 3911175
/usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
mogilefsd 6825 mog mem REG 3,1 114896 3911192
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
mogilefsd 6825 mog mem REG 3,1 22284 3911195
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
mogilefsd 6825 mog mem REG 3,1 17920 3911189
/usr/lib/perl/5.8.4/auto/IO/IO.so
mogilefsd 6825 mog mem REG 3,1 18876 1602535
/lib/tls/libcrypt-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 1254468 1602534
/lib/tls/libc-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 78233 1602547
/lib/tls/libpthread-0.60.so
mogilefsd 6825 mog mem REG 3,1 134496 1602537
/lib/tls/libm-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 9872 1602536
/lib/tls/libdl-2.3.2.so
mogilefsd 6825 mog mem REG 3,1 90248 310751
/lib/ld-2.3.2.so
mogilefsd 6825 mog 0u CHR 136,17 19 /dev/pts/17
mogilefsd 6825 mog 1u CHR 136,17 19 /dev/pts/17
mogilefsd 6825 mog 2u CHR 136,17 19 /dev/pts/17
mogilefsd 6825 mog 3r REG 3,1 2131 4300841
/usr/bin/mogilefsd
mogilefsd 6825 mog 4u sock 0,4 514924715 can't
identify protocol
mogilefsd 6825 mog 5u IPv4 514924716 TCP
*:afs3-callback (LISTEN)
mogilefsd 6825 mog 6r 0000 0,8 0 514924717 unknown
inode type
mogilefsd 6825 mog 7u unix 0xcee9e700 514924718 socket
mogilefsd 6825 mog 8u unix 0xdc632b00 514924720 socket
mogilefsd 6825 mog 9u unix 0xf5050580 514924722 socket
mogilefsd 6825 mog 10u unix 0xdc632700 514924724 socket
mogilefsd 6825 mog 11u unix 0xce17f580 514924726 socket
mogilefsd 6825 mog 12u unix 0xce17fd80 514924728 socket
mogilefsd 6825 mog 13u unix 0xdc632300 514924730 socket
mogilefsd 6825 mog 14u unix 0xdc632d00 514924732 socket
mogilefsd 6825 mog 15u unix 0xce17f980 514924734 socket
mogilefsd 6825 mog 16u unix 0xecb24d00 514924736 socket
mogilefsd 6825 mog 17u unix 0xf5050d80 514924738 socket
mogilefsd 6825 mog 18u unix 0xecb24900 514924740 socket
mogilefsd 6825 mog 19u unix 0xc477f580 514924742 socket
mogilefsd 6825 mog 20u unix 0xebfd6680 514933899 socket
mogilefsd 6825 mog 21u unix 0xeabcf480 514924746 socket
strace -p 6825
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL) = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000,
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
(goes on forever)
More information about the mogilefs
mailing list