Input on best practices for serving files

Thu Apr 10 05:37:54 UTC 2008

Mark Smith wrote:
> >  I've been playing with MogileFS for quite some time in testing, but
> >  I'm getting ready to actually do something with it now and have been
> >  searching for some "best practices" regarding serving the stored
> >  mogileFS files.  I've come up empty searching for this information
> >  aside from some mailing list posts, so I figure that I just need to
> >  get some input, write up my assumptions and then I'd like to post to
> >  the wiki.  If this is good enough, who do I ask for an account to
> >  update the wiki (seems I need an invite key)?  After it is up there, I
> >  hope people can just make incremental revisions to get it into better
> >  shape.
>
> I have no idea about the MogileFS wiki... we recently put something 
> for Perlbal on Google Code, might be useful to do the same thing for 
> MogileFS...  Brad?
>
> >  To begin with, it is recommended to write an "interpretation" layer
> >  that translates the end-user URL to an internal mogileFS layer.  For
> >  this example, a url like
> >  "http://static.myapp.com/users/1234567-thumb.jpg" would then be
> >  translated to a mogileFS stored key, along with the class.  To keep
> >  this simple, the key could be "1234567:thumb" and the class would be
> >  "users".
>
> Yes, typically this is in your backend webservers.  See below.
>
> >  The next step is to write some code that takes the key and class,
> >  fetches the paths from a tracker and then, in some fashion, proxies
> >  the image response.
>
> Perlbal is the recommended way of doing this, actually, as it does 
> most of the work for you.
>
> >  There are two distinct points where caching can be utilized to speed
> >  up the serving of files.  Up front, you can cache the entire URL (so
> >  that if http://static.myapp.com/users/1234567-thumb.jpg has been seen,
> >  it returns the last response, just like any other image proxy would
> >  work).  Secondarily, cache the resulting URLs returned from the
> >  tracker.  This is where memcached plays in nicely together with
> >  mogileFS.
>
> Perlbal does the second kind of caching in itself, and the first kind 
> can be put up front with squid/other types of caches and has been done 
> with success.
>
> >  I'd also like to provide some sample code for fetching the paths and
> >  reproxying, but I'm uncertain of the best way to do this and I'm
> >  hoping others can help.  If you just forward the request to the
> >  internal tracker URL, I don't quite see how to set the mime-type and
> >  other headers (perhaps this is something easy in perlbal?  I have to
> >  admit my ignorance on this.)
>
> Actually this is something done at the backend.  Perlbal doesn't know 
> what kind of file it is, neither does MogileFS.  You're expected to 
> have some sort of logic in your application that sends back the proper 
> headers for the file you want to reproxy.
>
> >  The other point that I'd like to talk about is an image manipulation
> >  layer.  Something like how most image hosting services do that checks
> >  the incoming referrer header.  If the referrer is blank, or the host
> >  is not equal to "myapp.com <http://myapp.com>", then affix a 
> "stamp" on to the top of the
> >  image.  Has there been any thoughts for a mogileFS Cookbook setup?  I
> >  think it would greatly help out newcomers to the product.
>
> Easily done in the traditional setup.
>
> So, enough about that - this is traditionally how MogileFS is used. 
>  There are other ways of using it, but it's sort of designed around 
> the idea of using Perlbal, so many of the pieces fit more neatly if 
> you do.
>
>                                  /->  [ webserver ] -> [ mogilefsd ]
>                                 /                           |
> { internet } -> [ perlbal ] -> <                            /
>                                 \                          /
>                                  \->  [ mogstored ] <-----/
>
> I don't know if that will translate well.  There are graphics somewhere...
>
> Anyway.  So Perlbal answers all incoming requests.  It talks to your 
> webservers and your MogileFS storage nodes 
> (mogstored/lighttpd/whatever you use as the storage server 
> component).  Your webserver talks to the MogileFS trackers (mogilefsd 
> instances).  The trackers talk to the storage nodes.
>
> Let's walk through the request, something like:
>
> ------
> 1) GET /some/path/somefile-123.jpg arrives at Perlbal.
>
> In the very simple case, let's ignore caching.  At this point Perlbal 
> will play this request out to a backend - acting as a reverse proxy in 
> this case.  The request gets sent to a backend webserver.
>
> 2) GET /some/path/somefile-123.jpg arrives at backend webserver.
>
> This is your custom application.  Again, I'm going to simplify 
> slightly.  You get a request for that URL, you know that it's in 
> MogileFS.  So you translate it into the proper path and send the 
> request to the tracker.
>
> 3) get_paths somefile:123 arrives at MogileFS tracker
>
> The tracker does its own magic to determine where the file is.  It 
> will return some internal path.
>
> 4) Webserver now knows that the file is at 
> http://10.0.0.34:8034/dev1/0/0/0/0.fid (plus another URL or two).
>
> Your application now constructs the headers needed - presumably you 
> either store some metadata locally (the LiveJournal approach, the 
> application stores a table containing things like - file format, size, 
> etc) or you assume it from the URL the user gave you.
>
> Either way, you only need to return the Content-Type header?  There 
> might be one or two more... Perlbal will handle Content-Length and the 
> like.
>
> So you return the response to Perlbal, with X-REPROXY-URL: internal 
> URL list.
>
> 5) Perlbal gets back the response from your webserver.
>
> This URL is now retrieved internally and Perlbal starts reproxying 
> it.  The headers returned to the user are some combination of the 
> response from your webserver and the response from the storage node.  
> (Sorry for not being very descriptive here... I don't remember exactly 
> which headers are pulled from where.)
>
> 6) User gets their file.
> ------
>
> Now, there are some obvious spots for caching.  Perlbal can do caching 
> based on the URL, skipping the entire hit to your webserver.  This 
> works great for resources that are public and don't need access 
> control.  If you do, you can cache at our own application layer, and 
> skip hitting MogileFS.
>
> You can also cache the file itself out front in something like squid, 
> that works too.  Same sort of access control questions though, but you 
> can get inventive about that if you want.  (.htaccess with a password 
> on user directories or something?)
>
> Anyway, this is somewhat rambling.  I and others are happy to answer 
> questions, just fire away.  If you really want to work on 
> documentation, I'd be happy to buy you a beer.  It's something sorely 
> sorely lacking.
>
>
> -- 
> Mark Smith / xb95
> smitty at gmail.com <mailto:smitty at gmail.com>
Awesome guide!
Where was this guide 6months ago? :(

Put that online somewhere ... would have saved me quite some  time 
searching and guessing.

about the walkthrough point 4)
would love to have the content-type of the file stored by mogilefsd.
otherwise i have to check the content-type of that file myself or encode 
the information about the content-type into the domain/class or the 
request url.

for example using:
domain: image
class: jpeg
-> content-type: image/jpeg

or url: /i/j/userphoto_1234
-> content-type: image/jpeg

but as mogilefs has to touch that file anyway it would be a good place 
to store the content-type (as it stores the size of the file)