TCP_NOPUSH and Mac OS X

Sat Mar 5 12:20:42 PST 2005

There was some discussion on this list last year about some fairly 
serious performance problems on Mac OS X. I was seeing these too, and I 
think I've isolated the problem to the TCP_NOPUSH option, and there's a 
one line hack which seems to solve it.

On OS X 10.3.8, running memcached locally and connecting to it on 
localhost, the symptoms were that there was a latency of about 0.2 
seconds between sending a command down the socket to the server and 
getting a reply. Doing a tcpdump showed that the delay was *exactly* 
200ms on every request, however running a kdump showed that memcached 
was actually writing its response to the socket pretty much 
instantaneously.

The relevant hack which seemed to get things working again was to 
simply comment out the line in memcached.c which set TCP_NOPUSH:

#ifdef TCP_NOPUSH
//    setsockopt(c->sfd, IPPROTO_TCP, TCP_NOPUSH, &val, sizeof(val));
#endif

It doesn't seem to be well known (at least, Google doesn't know) that 
TCP_NOPUSH is simply broken on OS X, and there was some evidence on the 
list that some people managed to get memcached running "out of the box" 
without this sort of latency. I'd be interested to know if that's still 
the case as it might shed a little more light on the problem.

However, I'm quite willing to conclude there is some underlying problem 
with the operating system, as things continue to get even stranger:

As I couldn't use TCP_NOPUSH, I put a "#undef TCP_NOPOSH" at the top of 
the file, which has the effect of making the code set TCP_NODELAY on 
the socket. This is exactly what I wanted:

#if !defined(TCP_NOPUSH)
     setsockopt(sfd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof(flags));
#endif

This worked quite nicely (about a factor of 3 speedup over the lo 
interface), but when I load tested it for an extended period (about 5 
minutes) it seemed to fairly reliably cause a kernel panic (stack trace 
attached for interest below). Dropping the TCP_NODELAY option again 
seemed to "fix" things, but I've got no idea whether this isn't simply 
because it conspires to slow things down enough such that whatever race 
condition in the kernel is causing the panic doesn't happen any more. 
Does anyone else see this, or is it just a (rather annoying) quirk of 
my machine?

Richard

*********

Sat Mar  5 19:33:12 2005

Unresolved kernel trap(cpu 0): 0x300 - Data access 
DAR=0x0000000000000014 PC=0x000000000020C8F4
Latest crash info for cpu 0:
    Exception state (sv=0x31747C80)
       PC=0x0020C8F4; MSR=0x00009030; DAR=0x00000014; DSISR=0x40000000; 
LR=0x0020C800; R1=0x12213C20; XCP=0x0000000C (0x300 - Data access)
       Backtrace:
          0x40471D84 0x0020C330 0x002463E4 0x00094160 0x01C465A0
Proceeding back via exception chain:
    Exception state (sv=0x31747C80)
       previously dumped as "Latest" state. skipping...
    Exception state (sv=0x28307000)
       PC=0x9002E1CC; MSR=0x0000F030; DAR=0x1C3EB004; DSISR=0x40000000; 
LR=0x00007B38; R1=0xBFFFF910; XCP=0x00000030 (0xC00 - System call)

Kernel version:
Darwin Kernel Version 7.8.0:
Wed Dec 22 14:26:17 PST 2004; root:xnu/xnu-517.11.1.obj~1/RELEASE_PPC

panic(cpu 0): 0x300 - Data access
Latest stack backtrace for cpu 0:
       Backtrace:
          0x000835F8 0x00083ADC 0x0001EDA4 0x00090BD8 0x00093FCC
Proceeding back via exception chain:
    Exception state (sv=0x31747C80)
       PC=0x0020C8F4; MSR=0x00009030; DAR=0x00000014; DSISR=0x40000000; 
LR=0x0020C800; R1=0x12213C20; XCP=0x0000000C (0x300 - Data access)
       Backtrace:
          0x40471D84 0x0020C330 0x002463E4 0x00094160 0x01C465A0
    Exception state (sv=0x28307000)
       PC=0x9002E1CC; MSR=0x0000F030; DAR=0x1C3EB004; DSISR=0x40000000; 
LR=0x00007B38; R1=0xBFFFF910; XCP=0x00000030 (0xC00 - System call)

Kernel version:
Darwin Kernel Version 7.8.0:
Wed Dec 22 14:26:17 PST 2004; root:xnu/xnu-517.11.1.obj~1/RELEASE_PPC

*********