Few queries on atomicity of requests

Fri Jun 20 18:44:04 UTC 2008

Microsoft's answer to memcached:

http://www.25hoursaday.com/weblog/2008/06/06/VelocityADistributedInMemoryCacheFromMicrosoft.aspx

Ryan

On Fri, Jun 20, 2008 at 2:42 PM, Daniel <memcached at puptv.com> wrote:
> Hi
>
> Thanks Dean and Josef for your responses...
>
> Josef, what's the name of the microsoft caching software, as I'm not
> familiar with it.
>
> I know what you are saying is true to a degree, but I think it is worth
> doing so every application that uses a database could gain the benefits
> of memcached without requiring changes to the application code.
>
> Wouldn't it be nice to get the speed boost of caching in all parts of
> your application without needing to complicate your code with memcached
> requests AND database requests?
>
> I'm not aware of any open source database that is setup with memory
> caching system that can be as large, fast, or as distributed as
> memcached... It truly is a brilliant solution as is.
>
>> Integrating memcached into a database server API wouldn't be hard but
>> I'm not sure it wouldn't cause a lot more problems than writing a
>> database caching system from scratch. What you're talking about would
>> require a great deal of work on the server's part to track which
>> memcached server the data is stored on, to make sure that data is
>> replaced when an update to the database is made, etc.
>>
>
> Why would it take so much work on the server's part to track which
> memcached server the data is stored on?  Could not the core database
> just use a hash? In fact, couldn't the hashing space could be
> dynamically controlled by the core database to handle moving hashing
> space from caching database daemon (CDD) to CDD.
>
> Of course, this solution should include all of the current memcached
> api, to support the current users, and to allow fast caching/sharing of
> application data that doesn't need to be backed by a database.
>
> >From what I understand of what you're asking, you basically want the
>> database to be able to cache the result of a given query with a given
>> set of parameters, so that if they query is made a second time with
> the
>> exact same parameters it can just "look it up" in it's cache and
> return
>> the results directly.
>
> No, that's the dumb way of caching. Surprisingly, even that way of
> caching can provide incredible performance gains at times, but I'm going
> to describe what I believe to be a much smarter way.
>
> In essence, every row from any table is cached as an element. The CDD
> has enough database smarts to process the sql, joins, etc from tables
> and indexes. It's just rather than having to go through the core
> database for every read, far more data that is known to be good will be
> available in the distributed cache.
>
>
>> Then, the database would have some mechanism so that it would "know"
>> when those cached queries are made invalid by *other* queries and
>> automatically evict the invalidated results from the cache.
>
> There's the beauty and the challenge...  By having the core database
> communicate back to the cache when an update occurs, the cached data
> stays current.
>
>> If it were really that simple , believe me, they'd all be doing it.
>> That'd kill your TPC-C scores!
>>
> By kill, I assume you mean show a factor of 10 improvement or more???
> I don't know, and I'll detail my ideas on an implementation.
>
> For discussion, let's only talk about the database type of queries. For
> the record, I believe the caching database daemon (CDD) should
> essentially implement the current memcached api if for no other reason,
> than that is how I see the different CDD's communicating amongst
> themselves without bothering the core database.
>
> The amount of memcached support implemented in the CDD is completely
> variable. Although I see it evolving to the point where the cached data
> includes a timestamp in order to work smoothly with transactions, it
> could provide a significant database performance boost by just
> pre-processing the sql requests and caching non-transaction related
> data.
>
> Let's look at how I see simple interactions working:
>
> On the database side, imagine a key/value pair is hashed and distributed
> for every database row.
>
> The sql query is requesting the most recent data from a single row not
> in a transaction. The CDD decodes the sql, determines the "hash" for
> that row, and returns with data from a memcached get. If the get fails,
> then the CDD controlling the hash will request the row from the database
> core so there isn't any issues of duplicate requests or race conditions.
>
> In the case of a sql query involving multiple rows from one or more
> tables, the CDD gets a list of all rows from the table, or an index if
> appropriate, and processes each row as above.
>
> Simple row updates not involved in a transaction are again passed
> through the appropriate CDD, to avoid race conditions.
>
> Now for some magic
>
> During a transaction things can get much more complicated, however we
> can start by handling simple cases. In the simplest transaction, when it
> is committed, the core database sends each CDD a list of its rows that
> were modified on a priority channel. The priority channel is used so the
> CDD will expire the affected, and optionally add the new row before the
> CDD handles any more normal requests.
>
> Within a transaction, row data requests can include a timestamp or
> revision index in an attempt to get data that was current at an earlier
> point in time. I believe this will then allow the caching system to
> duplicate the functionality of an mvcc database.
>
> Transactional updates will be passed to the core database. The core
> database will be modified, so it too can take advantage of the cache.
> Instead of going to the disk drive to request a row, it may request the
> data from the appropriate CDD.
>
> And now, when things go wrong...
>
> In my understanding of this, things become the most complicated when
> communications between the nodes fails. My current best idea involves a
> heartbeat sharing of disabled nodes. The goal is when any node cannot
> talk to another, it disables that node, and tells every other node about
> the problem on a priority channel. The calling node then falls back on
> the core database to handle all requests for that node.
>
> When the connection is restored, that node gets an updated or reset by
> the core database before restarting.
>
> So, in conclusion, the end goal of this is to provide memcached type
> caching to the database in such a way that the data it returns is always
> accurate. I'm not saying this would be easy, but it does seem to be well
> worth the effort.
>
> Thanks
>
> Daniel
>
>
>
>