Serving files securely to web clients via mogile

Thu Nov 8 18:11:55 UTC 2007

> The problem I'm trying to avoid by serving the files directly from the 
> storage nodes is overloading some dedicated machines with the work of 
> "proxying" data between the storage node and the end user. Also I don't 
> want the extra traffic on my local network. Path caching can still be 
> done by whoever provides the clients with the url's to the files. I can 
> easily avoid accessing the trackers too much by caching these paths, but 
> at the point when a client wants to d/l the file accessing the storage 
> node directly seems like the most optimized solution to me.

You're prematurely optimizing here. Nodes which do "proxying" are able 
to handle many hundreds (or thousands) of requests per second, which 
will be enough to overwhelm the harddrives of all of your storage nodes. 
When I've tried optimizing this path in the past, the best thing I've 
done is put the dumbest, fastest possible load balancing in front of 
_perlbal_, enabling path caching in perlbal, and having modules or 
lightweight backend nodes handle one more layer of path caching with 
memcached.

It will be a big loss if your backend is no longer tolerant to mogilefs 
reorganizing files on its own. As soon as you give a path to a client 
you *must* support it until they no longer have it cached. It would be a 
disservice to clients if their HEAD requests all started 302'ing, or 
worse not working at all. The amount of effort it takes perlbal to 
handle spoonfeeding will be less than the overhead by any reliability 
and security scheme you'll need to support otherwise. I'll say this with 
the ideal of being happily surprised if proved wrong ;)

> Alternatively I can do everything from the already existing web server 
> on each storage node by hacking my way through MogileFS sources to add 
> some file name security options when it configures the storage node's 
> web server. If this seems logical and a better solution than two web 
> servers running on each storage node, then I might actually do it 
> (assuming some support from "the experts" will be available).

If you can prove it has real tangible results, it's an open project and 
you can do whatever you want with it :)
I'd really rather never do that; why would I give clients the ability to 
DoS my storage nodes directly?

> Finally I'd like to ask if there's any preference as to what web server 
> to run on the storage nodes (lighttpd/apache/perlbal), and what was the 
> original intention behind this flexibility?

perlbal's okay for most loads now. Preferrably you're letting perlbal 
handle the writes, then the "GET's" are probably best served by 
something multithreaded like apache. I've heard from numerous people 
that pseudo threaded AIO doesn't work as well as expected under high IO 
load. Nothing empirical myself though.

The motivation is freedom of choice. If something comes along that can 
store/serve the files faster, or fit better into someone's setup, they 
should be able to use it.

-Dormando