memcached hiccups?

Mon Mar 26 17:02:44 UTC 2007

I've got something strange going on, and can't seem to figure it out. 
One of our memcached boxes will periodically choke on memcached.

Periodically usually means ~6-8 days or so, but there's some variation.

Choke usually means it fails sock_to_host from lots of clients, so we 
take it out of rotation and then check it later and put it back in.

It's memcached-1.2.1 from the rpm in dag's repository, using libevent 
1.3b from that same repo.

I'm running 4 instances on this box (4 Opteron cores, haven't had the 
guts to try the multithreaded one yet), all of which fail 
simultaneously, and one of the four instances says:  "Failed to write, 
and not due to blocking: Connection reset by peer".  The others don't 
say anything (only -v).

Finally, the syslog fills up with lots and lots of these:

"Mar 26 09:50:20 poseidon kernel: printk: 13382 messages suppressed."

But there's no indication of what message it was that was suppressed.

There's nothing else running on the box, there's no swap, it's not 
running out of RAM, etc.

Has anyone else seen this or anything like it?

Thanks,

Don