TCP_NOPUSH and Mac OS X
Richard Cameron
camster at citeulike.org
Sat Mar 5 12:20:42 PST 2005
There was some discussion on this list last year about some fairly
serious performance problems on Mac OS X. I was seeing these too, and I
think I've isolated the problem to the TCP_NOPUSH option, and there's a
one line hack which seems to solve it.
On OS X 10.3.8, running memcached locally and connecting to it on
localhost, the symptoms were that there was a latency of about 0.2
seconds between sending a command down the socket to the server and
getting a reply. Doing a tcpdump showed that the delay was *exactly*
200ms on every request, however running a kdump showed that memcached
was actually writing its response to the socket pretty much
instantaneously.
The relevant hack which seemed to get things working again was to
simply comment out the line in memcached.c which set TCP_NOPUSH:
#ifdef TCP_NOPUSH
// setsockopt(c->sfd, IPPROTO_TCP, TCP_NOPUSH, &val, sizeof(val));
#endif
It doesn't seem to be well known (at least, Google doesn't know) that
TCP_NOPUSH is simply broken on OS X, and there was some evidence on the
list that some people managed to get memcached running "out of the box"
without this sort of latency. I'd be interested to know if that's still
the case as it might shed a little more light on the problem.
However, I'm quite willing to conclude there is some underlying problem
with the operating system, as things continue to get even stranger:
As I couldn't use TCP_NOPUSH, I put a "#undef TCP_NOPOSH" at the top of
the file, which has the effect of making the code set TCP_NODELAY on
the socket. This is exactly what I wanted:
#if !defined(TCP_NOPUSH)
setsockopt(sfd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof(flags));
#endif
This worked quite nicely (about a factor of 3 speedup over the lo
interface), but when I load tested it for an extended period (about 5
minutes) it seemed to fairly reliably cause a kernel panic (stack trace
attached for interest below). Dropping the TCP_NODELAY option again
seemed to "fix" things, but I've got no idea whether this isn't simply
because it conspires to slow things down enough such that whatever race
condition in the kernel is causing the panic doesn't happen any more.
Does anyone else see this, or is it just a (rather annoying) quirk of
my machine?
Richard
*********
Sat Mar 5 19:33:12 2005
Unresolved kernel trap(cpu 0): 0x300 - Data access
DAR=0x0000000000000014 PC=0x000000000020C8F4
Latest crash info for cpu 0:
Exception state (sv=0x31747C80)
PC=0x0020C8F4; MSR=0x00009030; DAR=0x00000014; DSISR=0x40000000;
LR=0x0020C800; R1=0x12213C20; XCP=0x0000000C (0x300 - Data access)
Backtrace:
0x40471D84 0x0020C330 0x002463E4 0x00094160 0x01C465A0
Proceeding back via exception chain:
Exception state (sv=0x31747C80)
previously dumped as "Latest" state. skipping...
Exception state (sv=0x28307000)
PC=0x9002E1CC; MSR=0x0000F030; DAR=0x1C3EB004; DSISR=0x40000000;
LR=0x00007B38; R1=0xBFFFF910; XCP=0x00000030 (0xC00 - System call)
Kernel version:
Darwin Kernel Version 7.8.0:
Wed Dec 22 14:26:17 PST 2004; root:xnu/xnu-517.11.1.obj~1/RELEASE_PPC
panic(cpu 0): 0x300 - Data access
Latest stack backtrace for cpu 0:
Backtrace:
0x000835F8 0x00083ADC 0x0001EDA4 0x00090BD8 0x00093FCC
Proceeding back via exception chain:
Exception state (sv=0x31747C80)
PC=0x0020C8F4; MSR=0x00009030; DAR=0x00000014; DSISR=0x40000000;
LR=0x0020C800; R1=0x12213C20; XCP=0x0000000C (0x300 - Data access)
Backtrace:
0x40471D84 0x0020C330 0x002463E4 0x00094160 0x01C465A0
Exception state (sv=0x28307000)
PC=0x9002E1CC; MSR=0x0000F030; DAR=0x1C3EB004; DSISR=0x40000000;
LR=0x00007B38; R1=0xBFFFF910; XCP=0x00000030 (0xC00 - System call)
Kernel version:
Darwin Kernel Version 7.8.0:
Wed Dec 22 14:26:17 PST 2004; root:xnu/xnu-517.11.1.obj~1/RELEASE_PPC
*********
More information about the memcached
mailing list