[patch] edge-triggered memcached

Anatoly Vorobey mellon@pobox.com
Wed, 3 Sep 2003 02:14:27 +0300


--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I'm experimenting with memcached using edge-triggered epoll, trying
to see if this would work. With the patch I'm attaching to this
message, it seems to work pretty well, so anyone who feels like
trying it out is welcome to do so; please let me know, if you do,
whether it works for you.

A few words of explanation. "Edge-triggered" signalling means that
the waiting thread is signalled when the status of the IO channel is
changed from "not ready" to "ready"; for example, if it's an input
channel - from "no data to read" to "has some data to read". The catch
is that if you're reading from the channel, you have to read and read
until there's nothing left - if there's something left and you try
to wait on the channel again, its status will never change from
"no data" to "has data" (because it's already in the "has data" mode)
and you'll hang. epoll was originally edge-triggered, but the author
made it level-triggered (in this model, the waiting thread is kept
being signalled as long as the channel status is "ready") by 
default about half a year ago because too many programmers found 
edge-triggered confusing. It's still possible to use it edge-triggered 
by setting a flag on the event.

Since our persistent "runaway event" bug (in which some event starts
signalling all the time, apparently w/o being handled by us, and eating
lots of CPU time) keeps eluding us, and is almost certainly an epoll bug
rather than ours or libevent's  (and it's damn hard to study or debug - 
it happens only on live LiveJournal data, and very rarely at that) I 
thought we might try to work around it by using edge-triggered events;
in the edge-triggered model, an event simply *can't* keep signalling all 
the time w/o being handled - it'll only signal once, when the status
of the IO channel changes. And memcached already *is* written
so that it should conform to the edge-triggered model, that is,
all input is read until there's nothing left and only then we return
to the system to wait for more events.

We currently use epoll through libevent, not directly, and it doesn't
abstract or export changing level-triggered events to edge-triggered. 
To try out memcached with edge-triggered epoll, you'd have to patch
and recompile libevent with the patch to epoll.c I'm also attaching
to this message; then build memcached (patched with the other patch
I'm attaching, or updated from CVS after it makes it there) with 
something like: 
gcc -g -static -o memcached-et *.c ../path-to-libevent-sources/libevent.a

If this experiment proves successful and desirable, we may either
try to get the libevent maintainer to add the possibility of using
edge-triggered epoll, or ditch libevent and use epoll/kqueue (on
Linux/FreeBSD respectively) ourselves.

All comments, reviews, etc. are welcome.

Anatoly.

P.S. Explanation of the memcached patch: the socket that listen()'s
to new connections is also handled through epoll, and "ready" for it
means there are new connections to accept(). However, currently we
do one accept() and return to waiting for the next event; this
means that if several new connections arrive nearly simultaneously,
we'll return without "reading the whole data" (accept()ing all the
connections) and this socket will never get signalled again, so
no new connections will be accepted. That's the only violation of the
edge-triggered "rule" I could find in our sources so far.

-- 
avva

--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff_epoll

--- ../libevent-0.7a/epoll.c	Wed Apr  9 11:06:32 2003
+++ epoll.c	Thu Aug 28 02:47:26 2003
@@ -252,6 +252,7 @@
 	evep = &epollop->fds[fd];
 	op = EPOLL_CTL_ADD;
 	events = 0;
+        events |= EPOLLET;
 	if (evep->evread != NULL) {
 		events |= EPOLLIN;
 		op = EPOLL_CTL_MOD;
@@ -299,6 +300,7 @@
 
 	op = EPOLL_CTL_DEL;
 	events = 0;
+        events |= EPOLLET;
 
 	if (ev->ev_events & EV_READ)
 		events |= EPOLLIN;
@@ -308,11 +310,11 @@
 	if ((events & (EPOLLIN|EPOLLOUT)) != (EPOLLIN|EPOLLOUT)) {
 		if ((events & EPOLLIN) && evep->evwrite != NULL) {
 			needwritedelete = 0;
-			events = EPOLLOUT;
+			events = EPOLLOUT | EPOLLET;
 			op = EPOLL_CTL_MOD;
 		} else if ((events & EPOLLOUT) && evep->evread != NULL) {
 			needreaddelete = 0;
-			events = EPOLLIN;
+			events = EPOLLIN | EPOLLET;
 			op = EPOLL_CTL_MOD;
 		}
 	}

--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff_memcached

main -> wcmtools          src/memcached/memcached.c
--- cvs/wcmtools/memcached/memcached.c	Thu Aug 28 02:05:07 2003
+++ src/memcached/memcached.c	Tue Sep  2 16:09:12 2003
@@ -794,26 +794,27 @@
             addrlen = sizeof(addr);
             if ((sfd = accept(c->sfd, &addr, &addrlen)) == -1) {
                 if (errno == EAGAIN || errno == EWOULDBLOCK) {
-                    perror("accept() shouldn't block");
+                    exit = 1;
+                    break;
                 } else {
                     perror("accept()");
                 }
-                return;
+                break;
             }
             if ((flags = fcntl(sfd, F_GETFL, 0)) < 0 ||
                 fcntl(sfd, F_SETFL, flags | O_NONBLOCK) < 0) {
                 perror("setting O_NONBLOCK");
                 close(sfd);
-                return;
+                break;
             }            
             newc = conn_new(sfd, conn_read, EV_READ | EV_PERSIST);
             if (!newc) {
                 if (settings.verbose > 0)
                     fprintf(stderr, "couldn't create new connection\n");
                 close(sfd);
-                return;
+                break;
             }
-            exit = 1;
+
             break;
 
         case conn_read:

--6TrnltStXW4iwmi0--