Input on best practices for serving files
sanados
sanados at failure.at
Thu Apr 10 05:37:54 UTC 2008
Mark Smith wrote:
> > I've been playing with MogileFS for quite some time in testing, but
> > I'm getting ready to actually do something with it now and have been
> > searching for some "best practices" regarding serving the stored
> > mogileFS files. I've come up empty searching for this information
> > aside from some mailing list posts, so I figure that I just need to
> > get some input, write up my assumptions and then I'd like to post to
> > the wiki. If this is good enough, who do I ask for an account to
> > update the wiki (seems I need an invite key)? After it is up there, I
> > hope people can just make incremental revisions to get it into better
> > shape.
>
> I have no idea about the MogileFS wiki... we recently put something
> for Perlbal on Google Code, might be useful to do the same thing for
> MogileFS... Brad?
>
> > To begin with, it is recommended to write an "interpretation" layer
> > that translates the end-user URL to an internal mogileFS layer. For
> > this example, a url like
> > "http://static.myapp.com/users/1234567-thumb.jpg" would then be
> > translated to a mogileFS stored key, along with the class. To keep
> > this simple, the key could be "1234567:thumb" and the class would be
> > "users".
>
> Yes, typically this is in your backend webservers. See below.
>
> > The next step is to write some code that takes the key and class,
> > fetches the paths from a tracker and then, in some fashion, proxies
> > the image response.
>
> Perlbal is the recommended way of doing this, actually, as it does
> most of the work for you.
>
> > There are two distinct points where caching can be utilized to speed
> > up the serving of files. Up front, you can cache the entire URL (so
> > that if http://static.myapp.com/users/1234567-thumb.jpg has been seen,
> > it returns the last response, just like any other image proxy would
> > work). Secondarily, cache the resulting URLs returned from the
> > tracker. This is where memcached plays in nicely together with
> > mogileFS.
>
> Perlbal does the second kind of caching in itself, and the first kind
> can be put up front with squid/other types of caches and has been done
> with success.
>
> > I'd also like to provide some sample code for fetching the paths and
> > reproxying, but I'm uncertain of the best way to do this and I'm
> > hoping others can help. If you just forward the request to the
> > internal tracker URL, I don't quite see how to set the mime-type and
> > other headers (perhaps this is something easy in perlbal? I have to
> > admit my ignorance on this.)
>
> Actually this is something done at the backend. Perlbal doesn't know
> what kind of file it is, neither does MogileFS. You're expected to
> have some sort of logic in your application that sends back the proper
> headers for the file you want to reproxy.
>
> > The other point that I'd like to talk about is an image manipulation
> > layer. Something like how most image hosting services do that checks
> > the incoming referrer header. If the referrer is blank, or the host
> > is not equal to "myapp.com <http://myapp.com>", then affix a
> "stamp" on to the top of the
> > image. Has there been any thoughts for a mogileFS Cookbook setup? I
> > think it would greatly help out newcomers to the product.
>
> Easily done in the traditional setup.
>
> So, enough about that - this is traditionally how MogileFS is used.
> There are other ways of using it, but it's sort of designed around
> the idea of using Perlbal, so many of the pieces fit more neatly if
> you do.
>
> /-> [ webserver ] -> [ mogilefsd ]
> / |
> { internet } -> [ perlbal ] -> < /
> \ /
> \-> [ mogstored ] <-----/
>
> I don't know if that will translate well. There are graphics somewhere...
>
> Anyway. So Perlbal answers all incoming requests. It talks to your
> webservers and your MogileFS storage nodes
> (mogstored/lighttpd/whatever you use as the storage server
> component). Your webserver talks to the MogileFS trackers (mogilefsd
> instances). The trackers talk to the storage nodes.
>
> Let's walk through the request, something like:
>
> ------
> 1) GET /some/path/somefile-123.jpg arrives at Perlbal.
>
> In the very simple case, let's ignore caching. At this point Perlbal
> will play this request out to a backend - acting as a reverse proxy in
> this case. The request gets sent to a backend webserver.
>
> 2) GET /some/path/somefile-123.jpg arrives at backend webserver.
>
> This is your custom application. Again, I'm going to simplify
> slightly. You get a request for that URL, you know that it's in
> MogileFS. So you translate it into the proper path and send the
> request to the tracker.
>
> 3) get_paths somefile:123 arrives at MogileFS tracker
>
> The tracker does its own magic to determine where the file is. It
> will return some internal path.
>
> 4) Webserver now knows that the file is at
> http://10.0.0.34:8034/dev1/0/0/0/0.fid (plus another URL or two).
>
> Your application now constructs the headers needed - presumably you
> either store some metadata locally (the LiveJournal approach, the
> application stores a table containing things like - file format, size,
> etc) or you assume it from the URL the user gave you.
>
> Either way, you only need to return the Content-Type header? There
> might be one or two more... Perlbal will handle Content-Length and the
> like.
>
> So you return the response to Perlbal, with X-REPROXY-URL: internal
> URL list.
>
> 5) Perlbal gets back the response from your webserver.
>
> This URL is now retrieved internally and Perlbal starts reproxying
> it. The headers returned to the user are some combination of the
> response from your webserver and the response from the storage node.
> (Sorry for not being very descriptive here... I don't remember exactly
> which headers are pulled from where.)
>
> 6) User gets their file.
> ------
>
> Now, there are some obvious spots for caching. Perlbal can do caching
> based on the URL, skipping the entire hit to your webserver. This
> works great for resources that are public and don't need access
> control. If you do, you can cache at our own application layer, and
> skip hitting MogileFS.
>
> You can also cache the file itself out front in something like squid,
> that works too. Same sort of access control questions though, but you
> can get inventive about that if you want. (.htaccess with a password
> on user directories or something?)
>
> Anyway, this is somewhat rambling. I and others are happy to answer
> questions, just fire away. If you really want to work on
> documentation, I'd be happy to buy you a beer. It's something sorely
> sorely lacking.
>
>
> --
> Mark Smith / xb95
> smitty at gmail.com <mailto:smitty at gmail.com>
Awesome guide!
Where was this guide 6months ago? :(
Put that online somewhere ... would have saved me quite some time
searching and guessing.
about the walkthrough point 4)
would love to have the content-type of the file stored by mogilefsd.
otherwise i have to check the content-type of that file myself or encode
the information about the content-type into the domain/class or the
request url.
for example using:
domain: image
class: jpeg
-> content-type: image/jpeg
or url: /i/j/userphoto_1234
-> content-type: image/jpeg
but as mogilefs has to touch that file anyway it would be a good place
to store the content-type (as it stores the size of the file)
More information about the mogilefs
mailing list