Revisited: CPU Utilization by the mogilefsd process

mischa.z mischa.z at gmail.com
Wed Feb 21 16:37:07 UTC 2007


Hi,

I am still experiencing the CPU utilization problem that was mentioned 
on the list and got patched last week with svn740 (mogilefsd busywait 
bug). I'm starting to wonder if my problem is unrelated to the reported 
one. I've collected some debug info that hopefully gives you guys a clue 
of what's going on. The tracker I am using is running on a test machine. 
No client was using the tracker at the point when the CPU load went up 
(which basically happened shortly after mogilefsd got started).
Our live system is using svn518 which is fine but we run into the same 
behaviour there when using the newest version.
Any ideas?
Thanks,
Mischa



uname -a
Linux 2.6.10 #2 SMP Wed Feb 16 19:18:05 UTC 2005 i686 GNU/Linux

svn info
Path: .
URL: http://code.sixapart.com/svn/mogilefs/trunk
Repository UUID: f67b2e87-0811-0410-a7e0-dd94e48410d6
Revision: 740


top:
  6825 mog       21   0 13288  10m 2288 R 90.7  0.9   1:49.45 mogilefsd 
 

  6998 mog       16   0 14876  12m 2388 R 10.9  1.0   0:15.66 mogilefsd 
 
 
 

  6827 mog       15   0 14940  12m 2312 S  3.6  1.0   0:05.59 mogilefsd 
 

  6840 mog       15   0 13288  10m 2296 S  1.6  0.9   0:02.23 mogilefsd 
 

  6826 mog       15   0 13156  10m 2280 S  1.3  0.8   0:01.80 mogilefsd 
 

  6835 mog       15   0 13156  10m 2280 S  1.3  0.8   0:01.30 mogilefsd 
 

  6836 mog       15   0 13156  10m 2280 S  1.3  0.8   0:01.31 mogilefsd 
 

...

log:
beginning run
Job checker has only 0, wants 1, making 1.
Job replicate has only 0, wants 1, making 1.
Job delete has only 0, wants 1, making 1.
Job queryworker has only 0, wants 10, making 10.
Job monitor has only 0, wants 1, making 1.
Job reaper has only 0, wants 1, making 1.
[replicate(6827)] no_source: Source is no longer available replicating 
1294283
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 44:  took 
2.00 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 43:  took 
2.00 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 50:  took 
2.00 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 45:  took 
2.01 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 46:  took 
2.01 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 44:  took 
2.00 seconds out of 2 allowed
[monitor(6839)] Timeout contacting machine 10.0.10.24 for dev 43:  took 
2.00 seconds out of 2 allowed
crash log: Error writing: Resource temporarily unavailable at 
/usr/local/share/perl/5.8.4/MogileFS/Worker.pm line 89.

Child 6839 (monitor) died: 256 (UNEXPECTED)
Use of uninitialized value in hash element at 
/usr/local/share/perl/5.8.4/MogileFS/ProcManager.pm line 663.
Job monitor has only 0, wants 1, making 1.
[replicate(6827)] no_source: Source is no longer available replicating 
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 46:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 46:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 47:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45:  took 
2.01 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 50:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 51:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 49:  took 
2.01 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 48:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 49:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 44:  took 
2.00 seconds out of 2 allowed
[replicate(6827)] no_source: Source is no longer available replicating 
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 51:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 43:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 52:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 43:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 51:  took 
2.00 seconds out of 2 allowed
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 45:  took 
2.00 seconds out of 2 allowed
[replicate(6827)] no_source: Source is no longer available replicating 
1294283
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 49:  took 
2.00 seconds out of 2 allowed
[replicate(6827)] Fid 1361265 unreachable while replicating: Error: 
Resource http://10.0.10.24:7500/dev51/0/001/361/0001361265.fid failed: 
HTTP 404
[replicate(6827)] Failed copying fid 1361265 from devid 51 to devid 36 
(error type: src_error)
[replicate(6827)] source_down: No devices available replicating 1361265
[monitor(6998)] Timeout contacting machine 10.0.10.24 for dev 53:  took 
2.00 seconds out of 2 allowed


lsof -p 6825
COMMAND    PID USER   FD   TYPE     DEVICE    SIZE      NODE NAME
mogilefsd 6825  mog  cwd    DIR        3,1    4096    916492 /etc/mogilefs
mogilefsd 6825  mog  rtd    DIR        3,1    4096         2 /
mogilefsd 6825  mog  txt    REG        3,1 1057324    605198 /usr/bin/perl
mogilefsd 6825  mog  mem    REG        3,1  692456   1161022 
/usr/lib/libdb3.so.3.0.2
mogilefsd 6825  mog  mem    REG        3,1   34748   1602542 
/lib/tls/libnss_files-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1   17860   3090541 
/lib/libnss_db-2.2.so
mogilefsd 6825  mog  mem    REG        3,1   73304   1602539 
/lib/tls/libnsl-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1   67468   1161241 
/usr/lib/libz.so.1.2.2
mogilefsd 6825  mog  mem    REG        3,1  243592   1161395 
/usr/lib/libmysqlclient.so.12.0.0
mogilefsd 6825  mog  mem    REG        3,1   94800   1618856 
/usr/lib/perl5/auto/DBD/mysql/mysql.so
mogilefsd 6825  mog  mem    REG        3,1  139892   1622559 
/usr/lib/perl5/auto/DBI/DBI.so
mogilefsd 6825  mog  mem    REG        3,1   24220   3548791 
/usr/local/lib/perl/5.8.4/auto/BSD/Resource/Resource.so
mogilefsd 6825  mog  mem    REG        3,1   31548   3908282 
/usr/lib/perl/5.8.4/auto/List/Util/Util.so
mogilefsd 6825  mog  mem    REG        3,1   15556   3909133 
/usr/lib/perl/5.8.4/auto/Time/HiRes/HiRes.so
mogilefsd 6825  mog  mem    REG        3,1   15864   3909131 
/usr/lib/perl/5.8.4/auto/Sys/Syslog/Syslog.so
mogilefsd 6825  mog  mem    REG        3,1    6428   3909129 
/usr/lib/perl/5.8.4/auto/Sys/Hostname/Hostname.so
mogilefsd 6825  mog  mem    REG        3,1    9688   3911175 
/usr/lib/perl/5.8.4/auto/Cwd/Cwd.so
mogilefsd 6825  mog  mem    REG        3,1  114896   3911192 
/usr/lib/perl/5.8.4/auto/POSIX/POSIX.so
mogilefsd 6825  mog  mem    REG        3,1   22284   3911195 
/usr/lib/perl/5.8.4/auto/Socket/Socket.so
mogilefsd 6825  mog  mem    REG        3,1   17920   3911189 
/usr/lib/perl/5.8.4/auto/IO/IO.so
mogilefsd 6825  mog  mem    REG        3,1   18876   1602535 
/lib/tls/libcrypt-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1 1254468   1602534 
/lib/tls/libc-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1   78233   1602547 
/lib/tls/libpthread-0.60.so
mogilefsd 6825  mog  mem    REG        3,1  134496   1602537 
/lib/tls/libm-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1    9872   1602536 
/lib/tls/libdl-2.3.2.so
mogilefsd 6825  mog  mem    REG        3,1   90248    310751 
/lib/ld-2.3.2.so
mogilefsd 6825  mog    0u   CHR     136,17                19 /dev/pts/17
mogilefsd 6825  mog    1u   CHR     136,17                19 /dev/pts/17
mogilefsd 6825  mog    2u   CHR     136,17                19 /dev/pts/17
mogilefsd 6825  mog    3r   REG        3,1    2131   4300841 
/usr/bin/mogilefsd
mogilefsd 6825  mog    4u  sock        0,4         514924715 can't 
identify protocol
mogilefsd 6825  mog    5u  IPv4  514924716               TCP 
*:afs3-callback (LISTEN)
mogilefsd 6825  mog    6r  0000        0,8       0 514924717 unknown 
inode type
mogilefsd 6825  mog    7u  unix 0xcee9e700         514924718 socket
mogilefsd 6825  mog    8u  unix 0xdc632b00         514924720 socket
mogilefsd 6825  mog    9u  unix 0xf5050580         514924722 socket
mogilefsd 6825  mog   10u  unix 0xdc632700         514924724 socket
mogilefsd 6825  mog   11u  unix 0xce17f580         514924726 socket
mogilefsd 6825  mog   12u  unix 0xce17fd80         514924728 socket
mogilefsd 6825  mog   13u  unix 0xdc632300         514924730 socket
mogilefsd 6825  mog   14u  unix 0xdc632d00         514924732 socket
mogilefsd 6825  mog   15u  unix 0xce17f980         514924734 socket
mogilefsd 6825  mog   16u  unix 0xecb24d00         514924736 socket
mogilefsd 6825  mog   17u  unix 0xf5050d80         514924738 socket
mogilefsd 6825  mog   18u  unix 0xecb24900         514924740 socket
mogilefsd 6825  mog   19u  unix 0xc477f580         514924742 socket
mogilefsd 6825  mog   20u  unix 0xebfd6680         514933899 socket
mogilefsd 6825  mog   21u  unix 0xeabcf480         514924746 socket


strace -p 6825
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=8, u64=8}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
time(NULL)                              = 1172074616
epoll_wait(6, {{EPOLLOUT, {u32=9, u64=9}}, {EPOLLERR|EPOLLHUP|0x807a000, 
{u32=134717464, u64=29313286512664}}}, 1000, 250) = 2
  (goes on forever)


More information about the mogilefs mailing list