telling code to pull from the database when the data in a tablechanges

Tue May 22 14:24:47 UTC 2007

Nathan Nobbe,
< <what i dont understand is when the data profile of a particular table
changes, and the variable in the memcache that represents that 
<< table becomes stale, how can the variable be updated efficiently? 

This post is rather lengthy, I will answer your question directly first,
then provide details below, so you don't have to read the whole thing
unless you want to.  

First, let me warn you not to use triggers for this, they will hammer
your performance.   Some database systems, like Oracle, write all
changes to log files that can be read outside of running queries against
the database. You can have an object in a separate thread in a
background process which reads the log files.   Or, you can put
operating system level events on the log files and implement event
handlers so that your cache managers are notified when the log file
changes, this is how I do it for Oracle.  The other option is to poll
the database on some period using queries.  Polling will only work if
you take steps to cut down what has to be looked at, and if the polling
query only gets the data which changed and nothing else.  For this to be
efficient you have to have a way to first get the primary keys of what
changed, then run a second query to get the changes.  The query which
gets the primary keys has to be simple.  

I describe what I think is required to write such an intelligent polling
system below.  I call this intelligent caching and cache forwarding.
Please note, that even when using cache forwarding, there still must
exist at the point where the cache touches the database a process which
is examining the database (polling) looking for changes.  The key is to
make this polling mechanism very efficient.  The cache managers have to
run in separate threads in a background process.

There are a couple of concepts that help with managing caches.  The way
I do it is I use the cache as a buffering mechanism so that application
code never goes directly to the RDBMS, it always goes to the cache.  I
have cache managers which run in a separate thread in a background
process that synch the caches wih the RDBMS and with other caches.  The
cache managers add an overhead to the entire system but the end user
does not feel this overhead because the end user is never waiting on the
RDBMS, the end user always goes directly to the cache.

Building a caching system which ensures that the data is only stale
within acceptable limits is not a trivial task.   It requires modeling
the characteristics of the data at a level that no modeling tool that I
know of supports.  You then have to build processors that can move the
data around depending on the characteristics, liveness being the most
important characteristic.

The characteristics which are important are:  liveness, concurrency,
distribution, notification, and mutability.  Describing these
characteristics for each data element makes it possible to build very
efficient polling mechanism because the poller will not have to look at
everytihng in the database and will not have to move everything
everywhere.  

Liveness describes how stale the data can be and still be acceptable.
Liveness is expressed in milliseconds, A value of 0 means the data has
to be realtime.  By stale I mean that changes to the cache have not yet
been committed to the underlying database, or changes to the underlying
database have not yet been reflected in this cache.   I call the process
of reflecting changes in the database with the cache synchronization.
Synchronization also describes updating one cache with the changes in
another cache.  I have seen very few data elements which had a liveness
of 0 (meaning the data had to be realtime).

Concurrency describes the likelihood that two or more users are updating
this specific data at the same time.  Concurrency applies to both the
table level, the row level, and the column level.  I added the column
level after seeing many cases where for a given row only certain columns
were every updated.  Concurrency can be further described by listing the
user roles which are likely to be concurrent with each other or for
which users in that role are likely to be concurrant with each other.
Let's say that you have a bank account and you are updating your home
address online. The likelihood that another user is updating your home
address is zero (well, it better be zero or the bank has a big problem).
The likelihood that a bank employee is updating your home address is
small, but not zero.  And, I don't have to worry about all bank
employees because the bank employees authorized to make this change will
be playing certain roles.   The likelihood that another person who is
authorized to use your account is updating your home address is also
small, but not zero.  Knowing not just that concurrency is likely, but
also which users are likely to be performing actions that are concurrent
with each other opens up some interesting possibilities.  If I have
marked a table as being concurrent and then I define the roles which are
concurrent, I can show the users which have these roles what all the
users in their role are doing in the cases where there are potential
concurrency conflicts (deadlocks).  I can also design my applications so
that changes made by one user can be broadcast to another user without
needing to broadcast these changes to everyone.  Back to the bank
example, I only have to worry about concurrency if certain bank
employees are online or if a user authorized to use your account is
online.

Distribution describes the necessity of pushing the contents of one
cache to caches in other tiers in the N-tier architecture.  Distribution
also describes the tiers, what caches or persistent stores live on each
tier, and how the caches and persistent stores relate to each other.

Notification answers the question, "if something does change, when will
a concurrent user care"?  The answer is either while committing updates
to  data to the underlying database, or while viewing data.  I have to
use an example to explain this.  Say you are on a web page where results
of a query are being displayed.  This web page is not realtime because
at no point can you claim that what is on the web page is a 100%
reflection of what is in the database.  Now, let's say that you update
one row of data on the web page.  You will care if that same row was
updated previously by another user when commiting your changes.  Given
two users A and B, if A runs a query,  then B comes along and updates
one of the rows A is looking at and commits the changes, then A updates
that same row, A will care about B's update at the time that A commits
the A's changes.  In this case what I do is display a page which says
"...saving this data will overwrite changes made by B", I then show the
original data, the changes made by B, and the changes made by A and ask
A what to do, the options are - revert to original, keep Bs changes and
discard As changes, or keep As changes and discard Bs changes.   In this
case A cared about Bs updates when A was committing updates to A's
changes.  A database level transaction does not help here because no
programmer in their right mind would begin a transaction on a query
because the user MIGHT do an update at some undetermined time in the
future.   You would be surprised how many people will say that starting
a transaction when A executes the first query is the answer to this
problem. 

The second case, that a user cares about changes to the underlying data
while viewing the data is far more problematic.  Let's say that your job
is to ship products that sales people are selling on the phone.  A
salesperson creates an order, you ship the order.  In front of you is a
screen showing all the orders that need to be shipped.  Let's say
further that this product is very time sensitive, not shipping a placed
order is a big deal, let's say it will cost you your job.  In this case
you care if the Order table is updated at the time you are viewing the
Order data.  So, the screen you are looking at needs to be refreshed as
updates to the underlying Order table occur.  In this case the
programmer has no choice but to use a push model.  Ouch.  I have seen
this case in logistics management systems and order fulfillment systems.
But, I think this is the exception rather than the rule.

In all my years of building business applications I have encountered
data which truly had to be live very rarely.   Usually a lag time is
acceptable.  The length of the lag time depends on the business
processes that the data is used in.

Mutability describes the lifecycle of the data.  Mutability applies to
both tables and rows.  Some tables once populated never change.  Some
tables are only changed by inserts, once inserted the row never changes.
Some tables have both inserts and row updates.  Mutability describes not
just what kind of changes are made to tables and rows, but how often the
changes occur and how likely the changes are to occur.  Knowing this
information is crucial in building an intelligent caching system.

Based on having the characteristics of the underlying database's tables
fully defined it is quite easy to build an intelligent cache system
based on an efficient polling mechanism. 

I use a cache forwarding mechanism.  I always have a cache at each tier
in the N-tier architecture.  There is one cache on the same machine
where the RDBMS lives, this cache is always a copy of what is in the
database.  A cache manager runs queries in a separate thread in a
background process looking for changes.  The queries are always ran
against a special timestamp table I created to find what changed, then
if there are any changes, a second query is run to actually get the
changes.  This cache manager also updates the RDBMS with changes pushed
to this cache.   

There are also caches at the app server tier and the web server tier.
For each cache,  a cache manager runs in the background which ensures
that all changes to the cache are pushed to all other caches.   

I also have a cache in the web browser (not memcached, just javascript
arrays) which is populated with data that is related to the user.  Once
a user logs in, and I have the user ID I prepopulate the cache with all
rows related to the user ID.  For example, on a banking application I
worked on, once the customer logged in I used the cuystomer's social
security number to select all rows which were related either directly or
indirectly to that social security number.  This query returned all the
customer info plus all the account info for that customer.  The cache
then fulfills queries without needing to go to the database.

Triggers are not a good idea because they will significantly degrade the
performance of your database.  You can help yourself a lot by the design
of your database.  Two rules that I always follow in designing a
database - 1) all tables have a sequential integer as a primary key, a
number that doesn't mean anything to anyone but me.  I usually call this
column tablename_id; and, 2) I always have an intersection table named
timestamp_table_row_user which has five columns - table name, primary
key, username, timestamp, and timestamp_action.  Timestamp_action can be
one of update, insert, or delete.   I also create an archive table for
this table and periodically sweep everything over a certain age into the
archive table.  

Some people prefer to add a timestamp column to existing tables, I don't
like this idea because you need multiple timestamp columns, one for
insert, update, delete and you cannot re-create a change history because
the update timestamp column gets overwritten everytime.  Also, I don't
like having to modify every table schema with what is housekeeping
meta-data.  Using an intersection table and custom drivers makes it
possible to add this approach to existing databases and existing
applications without changing them.

Using the timestamp table makes it very simple to determine what has
changed in the database and more importantly how it changed.  Instead of
using triggers to modify the timestamp table I wrote my own jdbc drivers
which take any SQL statement being executed and create a second SQL
statement which makes an entry into the timestamp table to describe what
the original SQL statement did.

Using sequential integers for primary keys helps with identifying new
rows added to a table with a simpler mechanism than a timestamp.  All
you have to do is obtain max(primarykey), store it, then compare that to
the next max(primarykey).  Getting the new rows is simple, just select *
where primarykey > the first max(primarykey), or you can build a more
specific query by using the IN clause, as in SELECT * FROM tablename
WHERE primarykey IN (key1, key2, key3...).  Or you can use BETWEEN and
give the stored max(primarykey) + 1 and the latest max(primarykey).

Obtaining updated rows requires a timestamp.

I hope this helps, feel free to ask any questions.

Steve Donelow
President
Object Works, Inc.

*************************************************************************
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information.  If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited.  If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/memcached/attachments/20070522/7304e744/attachment-0001.htm