Distributing MogileFS' Database

Brad Fitzpatrick brad at danga.com
Sat Sep 23 07:39:42 UTC 2006


I agree we should treat scaling/load balancing and high-availability
differently.  They're different concepts.

Also ignore all the small config tables that are like 20 rows.  Assume
MySQL Cluster or something for that.  That too's an HA problem.

So let's focus on the huge tables (file and file_on).  The other
tables with 'fid' as primary key are the same as file_on, if we wanted to
partition those to.

As for scaling horizontally, one way is similar to the memcache vbucket
idea.  Imagine you have 16k virtual databases (or some other arbitrary
number, ideally a power-of-two).

When you start off, all 16k virtual database partitions are owned by your
1 physical database.  So the large file/file_on tables have their primary
key hashed onto those 16k (fid % 16k = ...., or crc32(fid) % 16k ...), and
that's the location you write to.   On a read, you try the primary, then
try all the old owners of that virtual bucket.  (another small table in
the database along with class/domain/etc is the partition owners...)  You
could even lazily migrate it there in a race-free manner, populating it
into the new correct home (INSERT IGNORE INTO ...) if you had to read it
from its non-primary home.  Then could delete from the non-primary home.

So basically you're approaching 2 billion keys, start to feel
uncomfortable for whatever reason, you add a new database, give it some
weight (say, 10%), and it starts magically spreading db traffic.

Again, this doesn't address HA issues.  But like you said, you can imagine
each partition being HA by itself via whatever dozen means.  (though I
agree we need to make that easier on people who aren't database people...)

- Brad


On Sat, 23 Sep 2006, dormando wrote:

> Hey all,
>
> My girlfriend ditched me on a friday night, so naturally I'm bored and
> thinking about MogileFS. (It's more interesting than fixing my six year
> old perl scripts).
>
> There was very brief talk at the summit about how to make MogileFS's
> database store more distributed. I was curious if the ML could flesh out
> some ideas a bit more? The reason I'd bring it up now is during all the
> minor hacking/documentation, we can spot areas that need adjustment and
> start working towards the eventual goal.
>
> I know Brad has/had some ideas of how it should work ("As easy as
> possible" for starters), plus there are issues of redundancy in the dataset.
>
> Obviously there are some easy parts and some hard parts, but I'm an
> idiot so it's all probably easy. Distributing the device table, class
> table, domain table, are probably hard. Distributing anything related to
> files relative to a domain is easy. A domain could automatically be
> distributed among more databases by adding "subdomains" that point to
> different databases, and a set of rules to resolve which subdomain a
> file is in. The most basic possible; you can put whole domains on
> different databases.
>
> Data redundancy? I would vote for the database admin to figure out their
> own redundancy. All of the Oracle shops I've dealt with have
> "Redundancy" through massive fiber SAN's, backups through tapes and SAN
> snapshots, and maybe a hotspare database server somewhere. Having data
> duplicated in mutiple databases becomes very expensive for places with
> non-Free DB software, and is often not done at all.
>
> Folks who run MySQL or Postgres and worry about redundancy just buy a
> second server and DRBD or Replication slave it up.
>
> Of course there're also wacky ideas... SQLite DBs distributed among all
> of the mogstored nodes with one per so many files? How does key
> discovery work if you want your database to be completely distributed?
> Anything else?
>
> -Dormando
>
>


More information about the mogilefs mailing list