memcached crashing

Jon Valvatne jon@valvatne.com
Wed, 30 Jun 2004 06:51:05 +0200


This is a multi-part message in MIME format.

--Multipart=_Wed__30_Jun_2004_06_51_05_+0200_Mpt=98.IyvJGtHj/
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

No core file. I've attached the last part of the strace output; my
mailer wasn't being nice with the wrapping.

Jon

----------------------------------------------------------------------
On Tue, 29 Jun 2004 21:37:00 -0700 (PDT)
Brad Fitzpatrick <brad@danga.com> wrote:

> Scary.
> 
> Run it with -r to increase core file size, and make sure the user you
> run
> it as has permission to write to the directory you start it from.
> (with -r
> it won't chdir to /)
> 
> Then with the core file, we can inspect it with gdb.
> 
> But maybe it's not crashing and just quitting, like the event loop is
> ending.
> 
> In that case, run it in the foreground but with strace in front of it:
> 
> strace ./memcached .....
> 
> Then paste what you see as its final output.
> 
> 
> 
> On Wed, 30 Jun 2004, Jon Valvatne wrote:
> 
> > Ok; thanks for the heads-up. I recompiled libevent without rtsig
> > support, but that doesn't seem to have changed anything at all.
> > Still
> > random crashes and refused connections.
> >
> > Is there any way to get any sort of debug information out of
> > memcached
> > when it crashes?
> >
> > Jon
> >
> > -------------------------------------------------------------------
> > ---
> > On Tue, 29 Jun 2004 21:10:53 -0700 (PDT)
> > Brad Fitzpatrick <brad@danga.com> wrote:
> >
> > > Do *not* use libevent's rtsig support.  I thought he removed that
> > > given
> > > how buggy it was.  Three really smart people worked on it for
> > > quite
> > > some
> > > time without getting it anywhere near reliable.  It's just a crap
> > > interface and it was never made to work with libevent.
> > >
> > > Use poll if you must, but epoll's really the best.
> > >
> > > - Brad
> > >
> > >
> > > On Wed, 30 Jun 2004, Jon Valvatne wrote:
> > >
> > > > Hello,
> > > >
> > > > I've been using memcached to add some caching to a production
> > > > system
> > > > to
> > > > speed things up. Everything worked smoothly on my test box, but
> > > > I
> > > > ran
> > > > into nothing but problems when trying to go live with the
> > > > changes:
> > > > Memcached would just die randomly, without any error message
> > > > whatsoever,
> > > > within minutes of startup. And even while it was running and
> > > > accepting
> > > > some connections, other connections appeared to be randomly
> > > > refused.
> > > >
> > > > The only difference between the test box and the production
> > > > system
> > > > is
> > > > that one is running Fedora Core 2, and the other Redhat 9.
> > > > Before I
> > > > try
> > > > to debug the situation more, I would like to ask: Does anyone
> > > > here
> > > > have
> > > > any experience running memcached with Redhat 9? There's
> > > > obviously no
> > > > epoll support, so I compiled the latest libevent with
> > > > --with-rtsig,
> > > > and
> > > > I'm assuming that's what memcached is using. Is this just
> > > > inherently
> > > > buggy, or so poor-performing that my system with about a hundred
> > > > connections and several operations per second will cause the
> > > > problem
> > > > I'm
> > > > seeing?
> > > >
> > > > One thing that worried me were the test results when compiling
> > > > libevent:
> > > >
> > > > Running tests:
> > > > KQUEUE
> > > > Skipping test
> > > > POLL
> > > >  test-eof: OKAY
> > > >  test-weof: OKAY
> > > >  test-time: OKAY
> > > >  regress: FAILED
> > > > SELECT
> > > >  test-eof: OKAY
> > > >  test-weof: OKAY
> > > >  test-time: OKAY
> > > >  regress: FAILED
> > > > RTSIG
> > > >  test-eof: OKAY
> > > >  test-weof: OKAY
> > > >  test-time: OKAY
> > > >  regress: FAILED
> > > > EPOLL
> > > > Skipping test
> > > >
> > > > What are these regress tests, and what would cause them to fail?
> > > >
> > > > By the way: Is there any way to ask memcached or libevent which
> > > > polling
> > > > mechanism is being used?
> > > >
> > > > Thanks in advance,
> > > >
> > > > Jon Valvatne
> > > >
> > > >
> >
> >

--Multipart=_Wed__30_Jun_2004_06_51_05_+0200_Mpt=98.IyvJGtHj/
Content-Type: text/plain;
 name="strace.txt"
Content-Disposition: attachment;
 filename="strace.txt"
Content-Transfer-Encoding: 7bit


getpid()                                = 23479
fcntl64(6, F_SETOWN, 23479)             = 0
fcntl64(6, F_SETFL, O_RDWR|O_NONBLOCK|O_ASYNC) = 0
accept(3, {sa_family=AF_INET, sin_port=htons(51627), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
fcntl64(7, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
fstat64(7, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fcntl64(7, F_GETFL)                     = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(7, F_SETSIG, 0x22)              = 0
getpid()                                = 23479
fcntl64(7, F_SETOWN, 23479)             = 0
fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK|O_ASYNC) = 0
accept(3, 0xbfffdc90, [16])             = -1 EAGAIN (Resource temporarily unavailable)
read(4, "version\r\n", 16384)           = 9
read(4, 0x8056f41, 16375)               = -1 EAGAIN (Resource temporarily unavailable)
setsockopt(4, SOL_TCP, TCP_CORK, [1], 4) = 0
write(4, "VERSION 1.1.11\r\n", 16)      = 16
setsockopt(4, SOL_TCP, TCP_CORK, [0], 4) = 0
read(4, "version\r\nquit\r\n", 16384)   = 15
read(4, "", 16369)                      = 0
close(4)                                = 0
read(12, "get tp1|index.html.1|1088570644."..., 2048) = 36
read(12, 0x8061f04, 2012)               = -1 EAGAIN (Resource temporarily unavailable)
setsockopt(12, SOL_TCP, TCP_CORK, [1], 4) = 0
time(NULL)                              = 1088570656
time(NULL)                              = 1088570656
write(12, "VALUE tp1|index.html.1|108857064"..., 47) = 47
write(12, "O:8:\"template\":22:{s:9:\"start_ta"..., 304477) = 65485
write(12, ":{s:4:\"name\";a:1:{i:0;a:1:{s:9:\""..., 238992) = 114689
write(12, "{s:8:\"elements\";a:1:{s:7:\"regula"..., 124303) = -1 EAGAIN (Resource temporarily unavailable)
fstat64(12, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fcntl64(12, F_GETFL)                    = 0x2802 (flags O_RDWR|O_NONBLOCK|O_ASYNC)
gettimeofday({1088570656, 291247}, NULL) = 0
gettimeofday({1088570656, 291314}, NULL) = 0
rt_sigtimedwait([IO 34], {si_signo=SIGRT_2, si_code=0x1, si_pid=65, si_uid=13, si_value={int=1, ptr=0x1}}, 0xbfffdc58, 8) = 34
rt_sigtimedwait([IO 34], {si_signo=SIGRT_2, si_code=0x1, si_pid=65, si_uid=4, si_value={int=1, ptr=0x1}}, 0xbfffdc58, 8) = 34
fcntl64(4, F_GETFL)                     = -1 EBADF (Bad file descriptor)
exit_group(0)                           = ?

--Multipart=_Wed__30_Jun_2004_06_51_05_+0200_Mpt=98.IyvJGtHj/--