have a hard time injecting when doing replication or deleting

Thu Jun 26 17:52:12 UTC 2008

I don't have any obvious answers...

Feels like I'd have to go look to be honest. Something's not being 
monitored... Something's running out of FD's/etc.

One tracker replicating can affect other trackers, because:

1) The tracker queries are a little DB intensive. Is your DB hitting max 
conns or any other weirdness?

2) One replicator will hit all storage nodes. Since you're pairing 
storage nodes with trackers, those are going to get hit regardless.

Can you monitor reading/writing on your storage nodes? Or watch mogadm 
more closely to see if they wander from the 'writable' status while you 
have many replicators running?

Do you have a huge replication queue? It's weird why starting 10 
replicators would cause everything to die all of a sudden, unless you 
have a pretty massive backlog of crap.

Do you have the same issue if you run one replicator and one delete per 
tracker?

Is lighttpd missconfigured anywhere? running out of FD's, or failing 
reads during this time? Are you running 2.17 from tarballs, or SVN 
trunk? Please try SVN trunk if so, we've fixed quite a bit of crap since 
then.

Anyone else have easy ideas to check? :\ A create_close makes a handful 
of DB queries then hits the storage node to verify file size.

-Dormando

> I've just tested it in the morning non-busy time, I set N to 10 and let 
> it run for 5 minutes.
> 
> Then tried to inject file, I got:
> 
> colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1
> MogileFS::Backend: tracker socket never became readable (local2:7001) 
> when sending command: [create_close 
> domain=tempf&fid=3462585&devid=16&path=http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15 
> <http://192.168.11.4:7500/dev16/0/003/462/0003462585.fid&size=10000&key=t15>
> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335
> 
> 0colo4:/home/www# perl script/inject.pl "tempf" "tempfile" t15 /tmp/aaa 1
> MogileFS::Backend: tracker socket never became readable (local4:7001) 
> when sending command: [create_close 
> domain=tempf&fid=3462595&devid=18&path=http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15 
> <http://192.168.11.6:7500/dev18/0/003/462/0003462595.fid&size=10000&key=t15>
> ] at /usr/local/share/perl/5.8.8/MogileFS/NewHTTPFile.pm line 335
> 
> By this time, the load avg of colo2 and colo4 were about 0.5, colo3 were 
> 1.3. They hadn't reach a swap yet.
> 
> Database hadn't reach the swap too, its avg was about 1.0 and I tested 
> some queries on it, the database was still fast.
> 
> There were no problem with get_paths, it's 100% success.
> 
> 
> So, what I'm trying to find out is why replication running on only colo3 
> affected other trackers on injection while the DB did not seemed to be 
> overloaded.
> 
> Also noticed that deletion has the similar symptom. And in the 
> busy-time, if I enabled just one replication or deletion job, I usually 
> see some failures on inject.
> 
> In the past, I had tried to set up more tracker on other machine with 
> the similar setting as colo4 and colo2, but it didn't help.
> 
> btw, these trackers are running on it's own physical machine, database 
> is dedicated. Each one of tracker machine also run mogstored for talking 
> with tracker and a lighttpd for serving customers.
> 
> -thank you
> kem
> 
> 2008/6/25 dormando <dormando at rydia.net <mailto:dormando at rydia.net>>:
> 
>     Exactly how many jobs of each are you running?
> 
>     Some jobs don't scale as well as others. Running more of them would
>     increase database load...
> 
>     Are you properly monitoring your database server? Does it end up in
>     swap, are you high on IO usage already?
> 
>     The mogilefs clients don't have a very forgiving timeout, so if a
>     tracker is in swap it'll be unlikely to ever finish it works. However
>     even if IO is loaded on a machine, the trackers usually respond in
>     decent time...
> 
>     Are you running the trackers on the database? Out of CPU? Is your
>     database actually under any load?
> 
>     There are no transactions in use in mogilefs... So your theory isn't
>     likely. The amount of stuff in the queue also doesn't have relation to
>     load. At least not in any mogilefs service I maintain... I can insert
>     20+ million fids into file_to_replicate and it'll work fine. It'll be
>     annoyed and that's a very stupid thing to do, but it won't overload
>     anything by nature of there being work to do.
> 
>     What version of mogilefs are you running? 2.17? Latest trunk?
> 
>     -Dormando
> 
>     Komtanoo Pinpimai wrote:
>      > Hello,
>      >
>      > My env is:
>      > 2 trackers running plenty of querywork,listener,monitor,reaper jobs.
>      > 1 tracker running only a few delete and replicate jobs.
>      >
>      > I guess, I didn't have this problem in the old version of
>      > MogileFS(version 1??).
>      > What's happening is when a hard drive has gone bad and I mark it
>     as dead,
>      > it will create tons of replication jobs. If I have 5 simultaneous
>      > replication jobs in a tracker,
>      > it will be really hard to inject one file into the system(with
>     tracker
>      > busy). Or I have only 1 replication job
>      > and it's in customer busy time, injecting file usually fails.
>      >
>      > What's annoying is no matter how many trackers or querywork/listener
>      > jobs I added,
>      > if there is a few simultaneous replicate jobs with tons of works in
>      > their queue,
>      > the whole trackers seem to be busy all the time. I also have a trying
>      > code, like trying to inject and sleep for 15 times,
>      > it still does not work very well in this situation.
>      >
>      > Since these trackers share the same database, I'm trying to guess, it
>      > must have something to do with transaction,
>      > for example, the replication or deletion jobs might create some
>      > transactions that lock some tables preventing
>      > other jobs from injecting files. And when there are tons of
>      > replication/deletion jobs in the line, the whole the system
>      > will turn into readonly mode.
>      >
>      > Have you ever experienced this and how do you deal with it ? Is
>     there a
>      > way to tell mogilefs not to use transaction but still work _ pretty
>      > correct _?
>      >
>      > --
>      > I'm going to stop checking email.
>      > Let's talk in my Hi5.
> 
> 
> 
> 
> -- 
> I'm going to stop checking email.
> Let's talk in my Hi5.