Serving files securely to web clients via mogile

Thu Nov 8 15:07:21 UTC 2007

Mark Smith wrote:
>> I want to use mogile to serve files to remote clients via http. Ideally
>> the files should be accessed directly from the storage nodes by the
>> clients. Mogile seems good for this since all file access is via http,
>> and it works nicely with robust http servers (like lighty).
>>
>> Problem is that I don't want the remote clients to see the actual
>> MogileFS file path when accessing the files and I want some security not
>> letting any client access any file. So instead of providing the client
>> with something like "http://myserver.com/dev1/0/000/000/000000001.fid",
>> I want the client to access some name I generate (for example with
>> lighty's mod_secdownload).
>>
>> Question is how would I go about configuring the storage nodes to do
>> this? Is there any such built in functionality in MogileFS?
>>     
>
> That's not how MogileFS is traditionally used in a web environment.
> The idea is that you run your MogileFS network internally and then
> expose to the user something else that uses the MogileFS trackers to
> translate your-names to internal-names.  The storage nodes are then
> never exposed to the end users.
>
> This setup gives you the advantage of controlling paths caching in
> your application (you know best when certain items should expire, if
> ever) as well as the flexibility to properly handle fallback in the
> case of unavailable storage nodes.  It's also safer, you don't have to
> plan your storage nodes to have separate upload and download
> processes.
>
> Anyway ... apparently we don't have "best practices" setup information
> on the site or wiki... that's lame.  Well, typically MogileFS is
> combined with Perlbal as the latter does most of the heavy lifting for
> using MogileFS.  You still need your application servers to do a paths
> lookup and decide on caching policies (if you want to enable path
> caching), but you don't need to do any file serving there, Perlbal
> will do it from the storage nodes.
>
> I have a strong feeling that this is just going to be more confusing
> than it was helpful, and I know our documentation is horrendous for
> beginners, so please reply (to the list!) with questions and we'll get
> you going.  :)
>
>   
The problem I'm trying to avoid by serving the files directly from the 
storage nodes is overloading some dedicated machines with the work of 
"proxying" data between the storage node and the end user. Also I don't 
want the extra traffic on my local network. Path caching can still be 
done by whoever provides the clients with the url's to the files. I can 
easily avoid accessing the trackers too much by caching these paths, but 
at the point when a client wants to d/l the file accessing the storage 
node directly seems like the most optimized solution to me.

One option is installing a second http server on each storage node, it 
will do whatever translation or security I want and then access the 
local files upon request from the remote client. The remote client will 
know what storage node to get to by talking with some web-app that uses 
the trackers to find what storage nodes have the requested file, this 
app can also do path caching if required.

Alternatively I can do everything from the already existing web server 
on each storage node by hacking my way through MogileFS sources to add 
some file name security options when it configures the storage node's 
web server. If this seems logical and a better solution than two web 
servers running on each storage node, then I might actually do it 
(assuming some support from "the experts" will be available).

Finally I'd like to ask if there's any preference as to what web server 
to run on the storage nodes (lighttpd/apache/perlbal), and what was the 
original intention behind this flexibility?