Concurrency Bug in memcached

Tue Feb 5 22:34:36 UTC 2008

Hi!

On Feb 5, 2008, at 12:00 AM, Tomash Brechko wrote:

> On Mon, Feb 04, 2008 at 17:42:54 -0800, Brian Aker wrote:
>> /* time-sensitive callers can call it by hand with this, outside the
>> normal ever-1-second timer */
>> static void set_current_time(void) {
>> -    current_time = (rel_time_t) (time(0) - stats.started);
>> +    volatile rel_time_t new_time;
>> +    new_time = (rel_time_t) (time(0) - stats.started);
>> +   while(!( __sync_bool_compare_and_swap(&current_time,  
>> current_time,
>> new_time)));
>> }
>>
>> The problem will be experienced when you are on a dual processor
>> machine (aka not a dual core (for most vendors) just a dual
>> processor).
>
> Could you be more specific on the essence of the problem?  I tried to
> think of several scenarios, but don't have any clue.
>
> As for your fix, you don't have to make new_time volatile: it's local
> to this function, you never update the value, so why would you want to
> re-read it from memory every time?
>
> Second, while(! __sync_bool_compare_and_swap()) construct is very
> suspicious.  It either may become an infinite loop if second argument
> is read only once.  Or, once current_time is volatile, the chance that
> &current_time and current_time would be different is very tiny.  But
> suppose we hit it.  This would mean current_time is been updated.  But
> what the difference would it make then?  The race haven't gone
> anywhere, you still may overwrite the future value with the past
> value.  And the point is that it's not really important for
> current_time to be precise.  Updating it once a second is very
> imprecise anyway.
>
>
> current_time is volatile unsigned int, so you may be pretty much sure
> that it will be updated atomically.  Even if the update is not atomic,

volatile just tells the compiler to not cache the value, it doesn't  
protect the actual value. Worse case on any architecture that is  
common will be that a missed expulsion for deletions will occur (which  
will cause a stressed LRU to purge a value before it should).

AKA You have multiple processors (not cores I believe (I am fuzzy on  
whether this is true for all architectures)). One will update this  
value, but another will be executing the value. You can have one  
processor miss the value. Get enough processors on a board and this  
will happen regularly.

The rule of thumb with threads is that if you have multiple threads  
see a value that is changing, you need a lock.

Now as far as the fix goes there are two options:
1) Loop until we can change the value (since we only have one thread  
running this function there is no race condition).
2) On failure just do not update current_time and wait till the next  
clock tick (currently every second (which we should optionally allow  
setting)).

 From your argument I take it that you would go with number 2? I  
thought it was sloppy to not catch the failure, which is why I added  
the loop, but I agree with you that it should be fine to miss a clock  
count.

Cheers,
	-Brian

--
_______________________________________________________
Brian "Krow" Aker, brian at tangent.org
Seattle, Washington
http://krow.net/                     <-- Me
http://tangent.org/                <-- Software
http://exploitseattle.com/    <-- Fun
_______________________________________________________
You can't grep a dead tree.