GNU malloc vs. slab alloc

Brad Fitzpatrick brad at danga.com
Thu Apr 27 23:53:41 UTC 2006


Back in the day, memcached used GNU malloc.  After a week of usage in 
production on LiveJournal, CPU usage would start growing, and then 
growing really rapidly.  After day 8, it'd be pretty much unusable, as 
memcached would be at 99.9% CPU usage the whole time.

So we switched to Judy.... same thing.

Then we wrote the slab allocator and haven't had problems in years. 
(and literally, we've had memcached instances running for nearly a year 
at a time....)

So I'm a little skeptical.  I'm not sure your test is harsh enough.


Torsten Foertsch wrote:
> Hi,
> 
> what is the best way to submit patches to memcached and get them accepted?
> 
> I have a patch for revision 260 ready that implements:
> 
> - item expiry relative to the least recent access
>   If a negative expiry timeout is set the actual expiry time is
>   it->time-it->exptime.
> - a "touch" operation
>   adjusts it->time. This is actually a "get" without transferring data.
> - active expiry
>   expired items are removed by delete_handler() a function that is called
>   every 5 seconds.
> - using system malloc instead of the slab allocator is turned from a compile
>   time define into a command line option, see below.
> - a new command line option that emulates a full cache, i.e. all set
>   operations return error (for testing clients)
> 
> 
> Situations when GNU malloc isn't bad at all
> 
> After reflecting on the fragmentation problem it occurred to me that with 
> active expiry and most of the items having an expiry time assigned using GNU 
> malloc wouldn't be that bad. I wrote then a test program that stresses the 
> server with a set operation with a random key followed by 2 get operations 
> with keys randomly selected from the 200 recently generated keys. Between 
> these operations the program sleeps for 0.15 sec. Thus, the server is 
> stressed with approx. 2 set and 4 get operations per second. Item sizes vary 
> randomly between 1 byte and 1 megabyte. Expiry times vary from 300 to 900 
> seconds after last usage.
> The server was started with the -M option and the test program counts failed 
> and succeeded set operations.
> 
> The program ran for over a week against 3 servers, GNU malloc with 256 and 320 
> MB limit and slab alloc with 384 MB limit. Each server processed over a 
> million set operations. The out of memory rate for the slab allocator with 
> 384 MB was more than 6 times higher than that for GNU malloc with 256 MB. GNU 
> malloc with 320 produced only 0.05% of the out of mem errors of the slab 
> allocator, in fact only 29 errors out of 1,286,612 set operations.
> 
> To estimate the effect of fragmentation on the error rate I drew a chart 
> showing the number of errors depending on the number of set operations. For 
> the slab allocator I expected a very straight line for there is no 
> fragmentation in principle. If fragmentation was an issue for GNU malloc I 
> would have expected a nonlinear curve with a lower slope a t start and a 
> higher slope at the end. But that did not happen. The curve is very straight, 
> as well.
> 
> To draw a conclusion I would say that for this usage pattern GNU malloc is to 
> be preferred because it utilizes the memory much better than the slab 
> allocator and does not suffer from fragmentation problems.
> 
> 
> What is wrong with these considerations?
> 
> Torsten
> 
> Attachments:
> memcached.png: the chart
> bigpatch: the patch
> x.pl: the test program 
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ------------------------------------------------------------------------
> 
> Index: slabs.c
> ===================================================================
> --- slabs.c	(revision 260)
> +++ slabs.c	(working copy)
> @@ -23,7 +23,8 @@
>  
>  #include "memcached.h"
>  
> -#define POWER_SMALLEST 3
> +#define POWER_SMALLEST 6        /* sizeof(item)>=32 plus some data. */
> +                                /* hence, minimal block == 64 bytes */
>  #define POWER_LARGEST  20
>  #define POWER_BLOCK 1048576
>  
> @@ -49,8 +50,9 @@
>  } slabclass_t;
>  
>  static slabclass_t slabclass[POWER_LARGEST+1];
> -static unsigned int mem_limit = 0;
> -static unsigned int mem_malloced = 0;
> +static unsigned long mem_limit = 0;
> +static unsigned long mem_malloced = 0;
> +static int use_system_malloc=0;
>  
>  unsigned int slabs_clsid(unsigned int size) {
>      int res = 1;
> @@ -67,24 +69,29 @@
>      return res;
>  }
>  
> -void slabs_init(unsigned int limit) {
> +void slabs_init(unsigned int limit, int use_malloc) {
>      int i;
>      int size=1;
>  
>      mem_limit = limit;
> -    for(i=0; i<=POWER_LARGEST; i++, size*=2) {
> -        slabclass[i].size = size;
> -        slabclass[i].perslab = POWER_BLOCK / size;
> -        slabclass[i].slots = 0;
> -        slabclass[i].sl_curr = slabclass[i].sl_total = slabclass[i].slabs = 0;
> -        slabclass[i].end_page_ptr = 0;
> -        slabclass[i].end_page_free = 0;
> -        slabclass[i].slab_list = 0;
> -        slabclass[i].list_size = 0;
> -        slabclass[i].killing = 0;
> +    use_system_malloc = use_malloc;
> +
> +    if( !use_system_malloc ) {
> +        for(i=0; i<=POWER_LARGEST; i++, size*=2) {
> +            slabclass[i].size = size;
> +            slabclass[i].perslab = POWER_BLOCK / size;
> +            slabclass[i].slots = 0;
> +            slabclass[i].sl_curr = slabclass[i].sl_total
> +                                 = slabclass[i].slabs = 0;
> +            slabclass[i].end_page_ptr = 0;
> +            slabclass[i].end_page_free = 0;
> +            slabclass[i].slab_list = 0;
> +            slabclass[i].list_size = 0;
> +            slabclass[i].killing = 0;
> +        }
> +
> +        slabs_preallocate(limit / POWER_BLOCK);
>      }
> -
> -    slabs_preallocate(limit / POWER_BLOCK);
>  }
>  
>  void slabs_preallocate (unsigned int maxslabs) {
> @@ -147,16 +154,20 @@
>      if (id < POWER_SMALLEST || id > POWER_LARGEST)
>          return 0;
>  
> +    if (use_system_malloc) {
> +        void *vp;
> +
> +        if (mem_limit && mem_malloced + size > mem_limit)
> +            return 0;
> +
> +        vp=malloc(size);
> +        if( vp ) mem_malloced += size;
> +        return vp;
> +    }
> +    
>      p = &slabclass[id];
>      assert(p->sl_curr == 0 || ((item*)p->slots[p->sl_curr-1])->slabs_clsid == 0);
>  
> -#ifdef USE_SYSTEM_MALLOC
> -    if (mem_limit && mem_malloced + size > mem_limit)
> -        return 0;
> -    mem_malloced += size;
> -    return malloc(size);
> -#endif
> -    
>      /* fail unless we have space at the end of a recently allocated page,
>         we have something on our freelist, or we could allocate a new page */
>      if (! (p->end_page_ptr || p->sl_curr || slabs_newslab(id)))
> @@ -189,14 +200,14 @@
>      if (id < POWER_SMALLEST || id > POWER_LARGEST)
>          return;
>  
> +    if (use_system_malloc) {
> +        mem_malloced -= size;
> +        free(ptr);
> +        return;
> +    }
> +
>      p = &slabclass[id];
>  
> -#ifdef USE_SYSTEM_MALLOC
> -    mem_malloced -= size;
> -    free(ptr);
> -    return;
> -#endif
> -
>      if (p->sl_curr == p->sl_total) { /* need more space on the free list */
>          int new_size = p->sl_total ? p->sl_total*2 : 16;  /* 16 is arbitrary */
>          void **new_slots = realloc(p->slots, new_size*sizeof(void *));
> @@ -308,3 +319,8 @@
>  
>      return 1;
>  }
> +
> +int
> +slabs_incr_memlimit( unsigned long limit ) {
> +    return mem_limit+=limit;
> +}
> Index: memcached.c
> ===================================================================
> --- memcached.c	(revision 260)
> +++ memcached.c	(working copy)
> @@ -64,7 +64,8 @@
>      /* no. of seconds in 30 days - largest possible delta exptime */
>      #define REALTIME_MAXDELTA 60*60*24*30
>  
> -    if (exptime == 0) return 0; /* 0 means never expire */
> +    if (exptime <= 0) return exptime; /* 0 means never expire */
> +                                      /* <0 relative to it->time */
>  
>      if (exptime > REALTIME_MAXDELTA)
>          return exptime;
> @@ -292,7 +293,9 @@
>              old_it = 0;
>          }
>  
> -        if (old_it && old_it->exptime && old_it->exptime < now) {
> +        if (old_it && old_it->exptime &&
> +            ((old_it->exptime < 0 && old_it->time-old_it->exptime < now) ||
> +             (old_it->exptime > 0 && old_it->exptime < now))) {
>              item_unlink(old_it);
>              old_it = 0;
>          }
> @@ -357,6 +360,8 @@
>          pos += sprintf(pos, "STAT bytes_read %llu\r\n", stats.bytes_read);
>          pos += sprintf(pos, "STAT bytes_written %llu\r\n", stats.bytes_written);
>          pos += sprintf(pos, "STAT limit_maxbytes %u\r\n", settings.maxbytes);
> +        pos += sprintf(pos, "STAT use_malloc %u\r\n", settings.use_malloc);
> +        pos += sprintf(pos, "STAT default_expiry %d\r\n", settings.default_exptime);
>          pos += sprintf(pos, "END");
>          out_string(c, temp);
>          return;
> @@ -493,6 +498,22 @@
>          return;
>      }
>  
> +    if (strcmp(command, "stats expiry")==0) {
> +        int bytes = 0;
> +        char *buf = item_exp_stats(&bytes);
> +        if (! buf) {
> +            out_string(c, "SERVER_ERROR out of memory");
> +            return;
> +        }
> +
> +        c->write_and_free = buf;
> +        c->wcurr = buf;
> +        c->wbytes = bytes;
> +        c->state = conn_write;
> +        c->write_and_go = conn_read;
> +        return;
> +    }
> +
>      out_string(c, "ERROR");
>  }
>  
> @@ -594,7 +615,10 @@
>          if (it && (it->it_flags & ITEM_DELETED)) {
>              it = 0;
>          }
> -        if (it && it->exptime && it->exptime < now) {
> +
> +        if (it && it->exptime &&
> +            ((it->exptime < 0 && it->time-it->exptime < now) ||
> +             (it->exptime > 0 && it->exptime < now))) {
>              item_unlink(it);
>              it = 0;
>          }
> @@ -675,7 +699,10 @@
>                  item_unlink(it);
>                  it = 0;
>              }
> -            if (it && it->exptime && it->exptime < now) {
> +
> +            if (it && it->exptime &&
> +                ((it->exptime < 0 && it->time-it->exptime < now) ||
> +                 (it->exptime > 0 && it->exptime < now))) {
>                  item_unlink(it);
>                  it = 0;
>              }
> @@ -708,6 +735,56 @@
>          }
>      }
>  
> +    if (strncmp(command, "touch ", 6) == 0) {
> +
> +        char *start = command + 6;
> +        char key[251];
> +        int next;
> +        item *it;
> +        time_t now = time(0);
> +
> +        if (settings.managed) {
> +            int bucket = c->bucket;
> +            if (bucket == -1) {
> +                out_string(c, "CLIENT_ERROR no BG data in managed mode");
> +                return;
> +            }
> +            c->bucket = -1;
> +            if (buckets[bucket] != c->gen) {
> +                out_string(c, "ERROR_NOT_OWNER");
> +                return;
> +            }
> +        }
> +
> +        while(sscanf(start, " %250s%n", key, &next) >= 1) {
> +            start+=next;
> +            stats.get_cmds++;
> +            it = assoc_find(key);
> +            if (it && (it->it_flags & ITEM_DELETED)) {
> +                it = 0;
> +            }
> +            if (settings.oldest_live && settings.oldest_live <= now &&
> +                it && it->time <= settings.oldest_live) {
> +                item_unlink(it);
> +                it = 0;
> +            }
> +
> +            if (it && it->exptime &&
> +                ((it->exptime < 0 && it->time-it->exptime < now) ||
> +                 (it->exptime > 0 && it->exptime < now))) {
> +                item_unlink(it);
> +                it = 0;
> +            }
> +
> +            if (it) {
> +                stats.get_hits++;
> +                item_update(it);
> +            } else stats.get_misses++;
> +        }
> +        out_string(c, "END");
> +        return;
> +    }
> +
>      if (strncmp(command, "delete ", 7) == 0) {
>          char key[251];
>          item *it;
> @@ -740,6 +817,11 @@
>              return;
>          }
>  
> +        if (exptime < 0) {
> +            out_string(c, "CLIENT_ERROR invalid expiration time");
> +            return;
> +        }
> +
>          if (delcurr >= deltotal) {
>              item **new_delete = realloc(todelete, sizeof(item *) * deltotal * 2);
>              if (new_delete) {
> @@ -855,6 +937,11 @@
>              return;
>          }
>  
> +        if (exptime < 0) {
> +            out_string(c, "ERROR invalid expiration time");
> +            return;
> +        }
> +
>          settings.oldest_live = realtime(exptime);
>          out_string(c, "OK");
>          return;
> @@ -892,6 +979,37 @@
>          return;
>      }
>      
> +    if (strncmp(command, "increment memlimit ", 19) == 0) {
> +        unsigned long mbytes;
> +        char *start = command+19;
> +        if (sscanf(start, "%u\r\n", &mbytes) == 1) {
> +            char *buf;
> +            int bytes;
> +
> +            buf = malloc(1024);
> +            if (!buf) {
> +                out_string(c, "SERVER_ERROR out of memory");
> +                return;
> +            }
> +            settings.maxbytes=slabs_incr_memlimit( mbytes*1024*1024 );
> +            bytes=snprintf( buf, 1024, "new limit: %ld bytes\r\nEND\r\n",
> +                            settings.maxbytes );
> +            if( bytes<0 || bytes>1024 ) {
> +                free(buf);
> +                out_string(c, "SERVER_ERROR out of memory");
> +                return;
> +            }
> +            c->write_and_free = buf;
> +            c->wcurr = buf;
> +            c->wbytes = bytes;
> +            c->state = conn_write;
> +            c->write_and_go = conn_read;
> +            return;
> +        }
> +        out_string(c, "CLIENT_ERROR bogus command");
> +        return;
> +    }
> +    
>      out_string(c, "ERROR");
>      return;
>  }
> @@ -931,6 +1049,8 @@
>  int try_read_network(conn *c) {
>      int gotdata = 0;
>      int res;
> +    char *cp;
> +    int len;
>  
>      if (c->rcurr != c->rbuf) {
>          if (c->rbytes != 0) /* otherwise there's nothing to copy */
> @@ -952,11 +1072,30 @@
>              c->rcurr  = c->rbuf = new_rbuf;
>              c->rsize *= 2;
>          }
> -        res = read(c->sfd, c->rbuf + c->rbytes, c->rsize - c->rbytes);
> +
> +        cp = c->rbuf + c->rbytes;
> +        len = c->rsize - c->rbytes;
> +        res = read(c->sfd, cp, len);
> +
>          if (res > 0) {
>              stats.bytes_read += res;
>              gotdata = 1;
>              c->rbytes += res;
> +
> +            /* check if a complete command line has been read */
> +            cp = memchr(cp, '\n', res);
> +            if( cp && cp - c->rcurr > 1 && *(cp - 1) =='\r' ) {
> +                cp++;
> +                *(cp - 2) = '\0'; /* replace \r */
> +
> +                process_command(c, c->rcurr);
> +
> +                c->rbytes -= (cp - c->rcurr);
> +                c->rcurr = cp;
> +
> +                break;
> +            }
> +
>              continue;
>          }
>          if (res == 0) {
> @@ -992,7 +1131,19 @@
>      int res;
>  
>      while (!exit) {
> -        /* printf("state %d\n", c->state);*/
> +        # if 0
> +        static char *statenames[]={
> +            "listening",  /* the socket which listens for connections */
> +            "read",       /* reading in a command line */
> +            "write",      /* writing out a simple response */
> +            "nread",      /* reading in a fixed number of bytes */
> +            "swallow",    /* swallowing unnecessary bytes w/o storing */
> +            "closing",    /* closing this connection */
> +            "mwrite"      /* writing out many items sequentially */
> +        };
> +        if (settings.verbose > 1)
> +            fprintf(stderr, "(%d) state %s, buf='%.30s...'\n", c->sfd, statenames[c->state], c->rbuf);
> +        # endif
>          switch(c->state) {
>          case conn_listening:
>              addrlen = sizeof(addr);
> @@ -1022,9 +1173,11 @@
>              break;
>  
>          case conn_read:
> +            /* is there a complete command line left over in the buffer? */
>              if (try_read_command(c)) {
>                  continue;
>              }
> +            /* no. read it from network */
>              if (try_read_network(c)) {
>                  continue;
>              }
> @@ -1206,7 +1359,7 @@
>                      if (c->binary) {
>                          c->ibytes = it->nbytes - 2;
>                      } else {
> -                    c->ibytes = it->nbytes;
> +                        c->ibytes = it->nbytes;
>                      }
>                      c->ipart = 2;
>                      break;
> @@ -1236,7 +1389,7 @@
>                          memcpy(c->ibuf + 1 + 4 + it->nkey - 3, (char *)&value_size, sizeof(value_size));
>                          c->ibytes =      1 + 4 + it->nkey - 3 + 4;
>                      } else {
> -                    c->ibytes = sprintf(c->ibuf, "VALUE %s %u %u\r\n", ITEM_key(it), it->flags, it->nbytes - 2);
> +                        c->ibytes = sprintf(c->ibuf, "VALUE %s %u %u\r\n", ITEM_key(it), it->flags, it->nbytes - 2);
>                      }
>                      if (settings.verbose > 1)
>                          fprintf(stderr, ">%d sending key %s\n", c->sfd, ITEM_key(it));
> @@ -1255,7 +1408,7 @@
>                          c->write_and_go = conn_read;
>  
>                      } else {
> -                    out_string(c, "END");
> +                        out_string(c, "END");
>                      }
>                      break;
>                  }
> @@ -1394,6 +1547,8 @@
>          }
>          delcurr = j;
>      }
> +
> +    item_expire();
>                  
>      return;
>  }
> @@ -1407,6 +1562,8 @@
>      printf("-u <username> assume identity of <username> (only when run as root)\n");
>      printf("-m <num>      max memory to use for items in megabytes, default is 64 MB\n");
>      printf("-M            return error on memory exhausted (rather than removing items)\n");
> +    printf("-s            use malloc(3) instead of our slab allocator, implies -M\n");
> +    printf("-t <num>      default expiration interval\n");
>      printf("-c <num>      max simultaneous connections, default is 1024\n");
>      printf("-k            lock down all paged memory\n");
>      printf("-v            verbose (print errors/warnings while in event loop)\n");
> @@ -1415,6 +1572,7 @@
>      printf("-i            print memcached and libevent license\n");
>      printf("-b            run a managed instanced (mnemonic: buckets)\n");
>      printf("-P <file>     save PID in <file>, only used with -d option\n");
> +    printf("-F            emulate full cache, i.e. set will never succeed\n");
>      return;
>  }
>  
> @@ -1538,7 +1696,7 @@
>      setbuf(stderr, NULL);
>  
>      /* process arguments */
> -    while ((c = getopt(argc, argv, "bp:m:Mc:khirvdl:u:P:")) != -1) {
> +    while ((c = getopt(argc, argv, "bp:m:Mst:c:khirvdl:u:P:F")) != -1) {
>          switch (c) {
>          case 'b':
>              settings.managed = 1;
> @@ -1549,9 +1707,15 @@
>          case 'm':
>              settings.maxbytes = atoi(optarg)*1024*1024;
>              break;
> +        case 's':
> +            settings.use_malloc = 1;
> +            /* FALL THROUGH */
>          case 'M':
>              settings.evict_to_free = 0;
>              break;
> +        case 't':
> +            settings.default_exptime = atoi(optarg);
> +            break;
>          case 'c':
>              settings.maxconns = atoi(optarg);
>              break;
> @@ -1587,6 +1751,9 @@
>          case 'P':
>              pid_file = optarg;
>              break;
> +        case 'F':
> +            settings.full++;
> +            break;
>          default:
>              fprintf(stderr, "Illegal argument \"%c\"\n", c);
>              return 1;
> @@ -1688,7 +1855,7 @@
>      stats_init();
>      assoc_init();
>      conn_init();
> -    slabs_init(settings.maxbytes);
> +    slabs_init(settings.maxbytes, settings.use_malloc);
>  
>      /* managed instance? alloc and zero a bucket array */
>      if (settings.managed) {
> Index: memcached.h
> ===================================================================
> --- memcached.h	(revision 260)
> +++ memcached.h	(working copy)
> @@ -24,7 +24,7 @@
>  };
>  
>  struct settings {
> -    unsigned int maxbytes;
> +    unsigned long maxbytes;
>      int maxconns;
>      int port;
>      struct in_addr interface;
> @@ -32,6 +32,9 @@
>      int managed;          /* if 1, a tracker manages virtual buckets */
>      time_t oldest_live;   /* ignore existing items older than this */
>      int evict_to_free;
> +    int use_malloc;
> +    int default_exptime;
> +    int full;
>  };
>  
>  extern struct stats stats;
> @@ -51,7 +54,10 @@
>      unsigned short  flags;
>      int    nbytes;  /* size of data */
>      time_t time;    /* least recent access */
> -    time_t exptime; /* expire time */
> +    time_t exptime; /* expire time: =0 - never expire */
> +                    /*              >0 - absolute time */
> +                    /*              <0 - relative to least recent access */
> +    struct exp_cls *exp_cls;    /* expiring class */
>      unsigned char it_flags;     /* ITEM_* above */
>      unsigned char slabs_clsid;
>      unsigned char nkey;         /* key length, with terminating null and padding */
> @@ -59,6 +65,18 @@
>      void * end[0];
>  } item;
>  
> +struct exp_cls {
> +    union {
> +        struct {                /* only valid when used */
> +            time_t time;
> +            unsigned int items_current;
> +        } s;
> +        struct exp_cls *next;   /* only valid when free */
> +    } u;
> +    item **items;
> +    unsigned int items_total;
> +};
> +
>  #define ITEM_key(item) ((char*)&((item)->end[0]))
>  
>  /* warning: don't use these macros with a function, as it evals its arg twice */
> @@ -157,7 +175,7 @@
>  /* slabs memory allocation */
>  
>  /* Init the subsystem. The argument is the limit on no. of bytes to allocate, 0 if no limit */
> -void slabs_init(unsigned int limit);
> +void slabs_init(unsigned int limit, int use_malloc);
>  
>  /* Preallocate as many slab pages as possible (called from slabs_init)
>     on start-up, so users don't get confused out-of-memory errors when
> @@ -185,6 +203,7 @@
>     0 = fail
>     -1 = tried. busy. send again shortly. */
>  int slabs_reassign(unsigned char srcid, unsigned char dstid);
> +int slabs_incr_memlimit( unsigned long limit );
>      
>  /* event handling, network IO */
>  void event_handler(int fd, short which, void *arg);
> @@ -217,6 +236,7 @@
>  void item_init(void);
>  item *item_alloc(char *key, int flags, time_t exptime, int nbytes);
>  void item_free(item *it);
> +void item_expire(void);     /* remove expired items, called from delete_handler() every 5 sec. */
>  
>  int item_link(item *it);    /* may fail if transgresses limits */
>  void item_unlink(item *it);
> @@ -227,3 +247,4 @@
>  char *item_cachedump(unsigned int slabs_clsid, unsigned int limit, unsigned int *bytes);
>  char *item_stats_sizes(int *bytes);
>  void item_stats(char *buffer, int buflen);
> +char* item_exp_stats(int *bytes);
> Index: items.c
> ===================================================================
> --- items.c	(revision 260)
> +++ items.c	(working copy)
> @@ -17,6 +17,7 @@
>  #include <time.h>
>  #include <event.h>
>  #include <assert.h>
> +#include <stdarg.h>
>  
>  #include "memcached.h"
>  
> @@ -31,8 +32,30 @@
>  static item *tails[LARGEST_ID];
>  unsigned int sizes[LARGEST_ID];
>  
> +# define e_next u.next
> +# define e_time u.s.time
> +# define e_curr u.s.items_current
> +
> +# define INITIAL_EXPIRING_SIZE 1024
> +static struct exp_cls **expiring;
> +static int exp_total;
> +static int exp_curr;
> +static struct exp_cls *free_exp_cls;
> +
> +# if 0
> +# define D(...) fprintf(stderr, __VA_ARGS__)
> +# else
> +# define D(...)
> +# endif
> +
>  void item_init(void) {
>      int i;
> +
> +    expiring=malloc( INITIAL_EXPIRING_SIZE*sizeof(*expiring) );
> +    exp_total=INITIAL_EXPIRING_SIZE;
> +    exp_curr=0;
> +    free_exp_cls=0;
> +
>      for(i=0; i<LARGEST_ID; i++) {
>          heads[i]=0;
>          tails[i]=0;
> @@ -40,12 +63,159 @@
>      }
>  }
>  
> +/* fetch a new expiring class and insert it into expiring before insert_before */
> +static struct exp_cls*
> +new_exp_cls( time_t time, int insert_before ) {
> +    struct exp_cls *cls;
> +    assert( insert_before<=exp_curr );
>  
> +    if( free_exp_cls ) {
> +        cls=free_exp_cls;
> +        free_exp_cls=free_exp_cls->e_next;
> +    } else {
> +        cls=malloc(sizeof(*cls));
> +        if( !cls ) return 0;
> +        cls->items=malloc(1000*sizeof(item*));
> +        if( !cls->items ) {
> +            free( cls );
> +            return 0;
> +        }
> +        cls->items_total=1000;
> +    }
> +
> +    cls->e_curr=0;
> +    cls->e_time=time;
> +    if( exp_curr>=exp_total ) {
> +        struct exp_cls **new_exp=realloc( expiring,
> +                                          2*exp_total*sizeof(*expiring) );
> +        if( !new_exp ) {
> +            cls->e_next=free_exp_cls;
> +            free_exp_cls=cls;
> +            return 0;
> +        }
> +        expiring=new_exp;
> +        exp_total*=2;
> +    }
> +
> +    memmove( &expiring[insert_before+1], &expiring[insert_before],
> +             (exp_curr-insert_before)*sizeof(*expiring) );
> +    expiring[insert_before]=cls;
> +
> +    exp_curr++;
> +
> +    return cls;
> +}
> +
> +static int
> +insert_exp( item *it ) {
> +    int high, low, middle;
> +    time_t time;
> +    struct exp_cls *cls;
> +
> +    /* remove it from an expiring class it may be in */
> +    if( it->exp_cls ) {
> +        cls=it->exp_cls;
> +        int i;
> +        for( i=0; i<cls->e_curr; i++ ) {
> +            if( cls->items[i]==it ) {
> +                cls->items[i]=0;
> +                it->refcount--;
> +                break;
> +            }
> +        }
> +        it->exp_cls=0;
> +    }
> +
> +    if( it->exptime==0 ) return 1;              /* success */
> +
> +    /* find new expiring class */
> +    time=it->exptime;
> +    if( time<0 ) time=it->time-it->exptime;
> +    if( time%5 ) time+=5-time%5; /* round up to next divisible by 5 */
> +
> +    if( exp_curr==0 ) {
> +        cls=new_exp_cls( time, 0 );
> +    } else {
> +        high=exp_curr;
> +        low=0;
> +
> +        while(high-low>1) {
> +            middle=(high+low)/2;
> +            if (time < expiring[middle]->e_time) {
> +                high=middle;
> +            } else {
> +                low=middle;
> +            }
> +        }
> +        if( time == expiring[low]->e_time ) {
> +            cls=expiring[low];
> +        } else if( high < exp_curr && time == expiring[high]->e_time ) {
> +            cls=expiring[high];
> +        } else if( time <  expiring[low]->e_time ) {
> +            cls=new_exp_cls( time, low );
> +        } else {                    /* time < expiring[high]->e_time */
> +            cls=new_exp_cls( time, high );
> +        }
> +    }
> +
> +    if( !cls ) return 0;
> +
> +    /* cls now points to the right class. so, append the item to it, */
> +    /* increment item's refcount and update the exp_cls pointer */
> +    if( cls->e_curr>=cls->items_total ) {
> +        item **new_items=realloc( cls->items,
> +                                  2*cls->items_total*sizeof(*new_items) );
> +        if( !new_items ) {
> +            return 0;
> +        }
> +        cls->items=new_items;
> +        cls->items_total*=2;
> +    }
> +    cls->items[cls->e_curr++]=it;
> +
> +    it->exp_cls=cls;
> +    it->refcount++;
> +    
> +    return 1;
> +}
> +
> +void
> +item_expire( void ) {
> +    time_t now=time(0);
> +    int i, j;
> +    struct exp_cls *cls;
> +    item *it;
> +
> +    for( i=0; i<exp_curr; i++ ) {
> +        cls=expiring[i];
> +        if( cls->e_time >= now ) break;
> +
> +        expiring[i]=0;
> +        for( j=0; j<cls->e_curr; j++ ) {
> +            it=cls->items[j];
> +            if( !it ) continue; /* this happens if an item's expiry time */
> +                                /* has been adjusted */
> +            /* no need to decrement refcount. item_unlink() calls insert_exp()
> +             * to remove the item from the exp_cls. It also decrements the
> +             * refcount */
> +            item_unlink(it);
> +        }
> +
> +        cls->e_next=free_exp_cls;
> +        free_exp_cls=cls;
> +    }
> +    exp_curr-=i;
> +    memmove( expiring, &expiring[i], exp_curr*sizeof(*expiring) );
> +}
> +
>  item *item_alloc(char *key, int flags, time_t exptime, int nbytes) {
>      int ntotal, len;
>      item *it;
>      unsigned int id;
> +    time_t now=time(0);
>  
> +    if( settings.full ) return 0;
> +
>      len = strlen(key) + 1; if(len % 4) len += 4 - (len % 4);
>      ntotal = sizeof(item) + len + nbytes;
>      
> @@ -58,33 +228,50 @@
>          int tries = 50;
>          item *search;
>  
> -        /* If requested to not push old items out of cache when memory runs out,
> -         * we're out of luck at this point...
> +        /* If requested to not push old items out of cache when memory runs
> +         * out, we're out of luck at this point...
>           */
>  
>          if (!settings.evict_to_free) return 0;
>  
>          /* 
>           * try to get one off the right LRU 
> -         * don't necessariuly unlink the tail because it may be locked: refcount>0
> -         * search up from tail an item with refcount==0 and unlink it; give up after 50
> -         * tries
> +         * don't necessarily unlink the tail because it may be locked:
> +         *   refcount>0
> +         * search up from tail an item with refcount==0 and unlink it;
> +         * give up after 50 tries
>           */
>  
>          if (id > LARGEST_ID) return 0;
>          if (tails[id]==0) return 0;
>  
> -        for (search = tails[id]; tries>0 && search; tries--, search=search->prev) {
> +        for (search = tails[id];
> +             tries>0 && search;
> +             tries--, search=search->prev) {
>              if (search->refcount==0) {
>                  item_unlink(search);
>                  break;
>              }
> +            if (search->refcount==1 && search->exp_cls) {
> +                assert( search->exptime!=0 );
> +                if( search->exptime<0 && search->time-search->exptime>now) {
> +                    fprintf(stderr, "prematurely unlinking ITEM %s "
> +                            "scheduled expiry in %d sec\n", ITEM_key(search),
> +                            search->time - search->exptime - now);
> +                } else if( search->exptime>now ) {
> +                    fprintf(stderr, "prematurely unlinking ITEM %s "
> +                            "scheduled expiry in %d sec\n", ITEM_key(search),
> +                            search->exptime - now);
> +                } /* else found an already expired item */
> +                item_unlink(search);
> +                break;
> +            }
>          }
>          it = slabs_alloc(ntotal);
>          if (it==0) return 0;
>      }
>  
> -    assert(it->slabs_clsid == 0);
> +    assert(settings.use_malloc || it->slabs_clsid == 0);
>  
>      it->slabs_clsid = id;
>  
> @@ -96,7 +283,10 @@
>      it->nkey = len;
>      it->nbytes = nbytes;
>      strcpy(ITEM_key(it), key);
> -    it->exptime = exptime;
> +    it->exptime = exptime ? exptime :
> +                  settings.default_exptime>0 ? now+settings.default_exptime :
> +                  settings.default_exptime;
> +    it->exp_cls = 0;
>      it->flags = flags;
>      return it;
>  }
> @@ -161,6 +351,7 @@
>      it->it_flags |= ITEM_LINKED;
>      it->time = time(0);
>      assoc_insert(ITEM_key(it), it);
> +    insert_exp( it );
>  
>      stats.curr_bytes += ITEM_ntotal(it);
>      stats.curr_items += 1;
> @@ -178,6 +369,8 @@
>          stats.curr_items -= 1;
>          assoc_delete(ITEM_key(it));
>          item_unlink_q(it);
> +        it->exptime=0;         /* if exp_time==0 */
> +        insert_exp( it );      /* then this is a remove_exp */
>      }
>      if (it->refcount == 0) item_free(it);
>  }
> @@ -196,6 +389,7 @@
>  
>      item_unlink_q(it);
>      it->time = time(0);
> +    insert_exp( it );
>      item_link_q(it);
>  }
>  
> @@ -206,6 +400,60 @@
>      return item_link(new_it);
>  }
>  
> +static int
> +bp( char **buf, int *len, int *curr, int extend, const char *fmt, ... ) {
> +    int n;
> +    va_list ap;
> +    char *cp;
> +
> +    while(1) {
> +        va_start( ap, fmt );
> +        n=vsnprintf( *buf + *curr, *len - *curr-1, fmt, ap );
> +        va_end( ap );
> +
> +        if( n < 0 || n > *len-*curr-1 ) {
> +            if( !extend ) return n;
> +            cp=realloc( *buf, *len*2 );
> +            if( cp ) {
> +                *buf=cp;
> +                *len*=2;
> +            } else {
> +                free( *buf );
> +                *buf=0;
> +                *len=0;
> +                *curr=0;
> +                return -1;
> +            }
> +        } else {
> +            *curr+=n;
> +            return 0;
> +        }
> +    }
> +}
> +
> +static int
> +fmt_item( char **buf, int *bsz, int *curr, int extend, item *it, time_t now ) {
> +    if( it->exptime ) {
> +        /* ITEM key [SIZE b; CREATION_TIME s; EXPIRE_RELATIVE_TO_NOW s] */
> +        if( it->exptime<0 ) {
> +            return bp(buf, bsz, curr, extend,
> +                      "ITEM %s [%u b; %lu s; %ld s]\r\n",
> +                      ITEM_key(it), it->nbytes - 2, it->time,
> +                      it->time-it->exptime-now);
> +        } else {
> +            return bp(buf, bsz, curr, extend,
> +                      "ITEM %s [%u b; %lu s; %ld s]\r\n",
> +                      ITEM_key(it), it->nbytes - 2, it->time,
> +                      it->exptime-now);
> +        }
> +    } else {
> +        /* ITEM key [SIZE b; CREATION_TIME s; never] */
> +        return bp(buf, bsz, curr, extend,
> +                  "ITEM %s [%u b; %lu s; NEVER]\r\n",
> +                  ITEM_key(it), it->nbytes - 2, it->time);
> +    }
> +}
> +
>  char *item_cachedump(unsigned int slabs_clsid, unsigned int limit, unsigned int *bytes) {
>      
>      int memlimit = 2*1024*1024;
> @@ -214,7 +462,7 @@
>      item *it;
>      int len;
>      int shown = 0;
> -    char temp[512];
> +    time_t now=time(0);
>      
>      if (slabs_clsid > LARGEST_ID) return 0;
>      it = heads[slabs_clsid];
> @@ -223,11 +471,11 @@
>      if (buffer == 0) return 0;
>      bufcurr = 0;
>  
> +    memlimit -= 6;              /* save 6 bytes for "END\r\n\0" */
>      while (it && (!limit || shown < limit)) {
> -        len = sprintf(temp, "ITEM %s [%u b; %lu s]\r\n", ITEM_key(it), it->nbytes - 2, it->time);
> -        if (bufcurr + len + 6 > memlimit)  /* 6 is END\r\n\0 */
> +        len = fmt_item( &buffer, &memlimit, &bufcurr, 0, it, now );
> +        if (len < 0 || bufcurr + len > memlimit)  
>              break;
> -        strcpy(buffer + bufcurr, temp);
>          bufcurr+=len;
>          shown++;
>          it = it->next;
> @@ -262,7 +510,7 @@
>  /* dumps out a list of objects of each size, with granularity of 32 bytes */
>  char* item_stats_sizes(int *bytes) {
>      int num_buckets = 32768;   /* max 1MB object, divided into 32 bytes size buckets */
> -    unsigned int *histogram = (int*) malloc(num_buckets * sizeof(int));
> +    unsigned *histogram = (unsigned*) malloc(num_buckets * sizeof(unsigned));
>      char *buf = (char*) malloc(1024*1024*2*sizeof(char));
>      int i;
>  
> @@ -297,3 +545,45 @@
>      free(histogram);
>      return buf;
>  }
> +
> +#define P(...) do {if( bp(&buf, &buflen, &bufcurr, 1, __VA_ARGS__) ) return 0;} while(0)
> +
> +char*
> +item_exp_stats(int *bytes) {
> +    int i, j, n;
> +    char *buf = malloc( 4096 );
> +    int buflen = 4096;
> +    int bufcurr;
> +    struct exp_cls *cls;
> +    time_t now=time(0);
> +    item *it;
> +
> +    if( !buf ) return 0;
> +
> +    bufcurr=0;
> +
> +    for( i=0, cls=free_exp_cls; cls; cls=cls->e_next ) i++;
> +    P( "# of free exp structs: %d\r\n", i );
> +    P( "# of used exp structs: %d\r\n", exp_curr );
> +
> +    for( i=0; i<exp_curr; i++ ) {
> +        cls=expiring[i];
> +        for( n=0, j=0; j<cls->e_curr; j++ ) {
> +            if( cls->items[j] ) n++;
> +        }
> +        P( "expiry scheduled for %d/%d/%d items in %d seconds\r\n",
> +           n, cls->e_curr, cls->items_total, cls->e_time-now );
> +        for( j=0; j<cls->e_curr; j++ ) {
> +            it=cls->items[j];
> +            if( !it ) continue; /* this may happen if an item's expiry time */
> +                                /* has been adjusted */
> +            if( fmt_item(&buf, &buflen, &bufcurr, 1, it, now) ) return 0;
> +        }
> +    }
> +
> +    P( "END\r\n" );
> +
> +    if( bytes ) *bytes=bufcurr;
> +    return buf;
> +}
> +



More information about the memcached mailing list