Replication Oddities

dormando dormando at rydia.net
Thu May 1 06:24:33 UTC 2008


Yup, that would be it.

I assume if you had tested some of the fids in file_to_replicate you 
would've found them lacking copies...

Been bitten by that myself... We can make that better. Nice catch :)

-Dormando

Brian Lynch wrote:
> Ahh, progress... 
> 
> I updated the configuration file on our tracker hosts to include the
> following line: 
> 
> old_repl_compat 0
> 
> At a glance, we appear to be getting less timeouts for replicate, query,
> and monitor workers.  The row count in file_to_replicate is going down
> (despite of the continuing FSCK).  We're keeping an eye on progress for
> now.  
> 
> mysql> select count(*) from file_to_replicate;
> +----------+
> | count(*) |
> +----------+
> | 58604833 |
> +----------+
> 1 row in set (14.81 sec)
> 
> mysql> select count(*) from file_to_replicate;
> +----------+
> | count(*) |
> +----------+
> | 58604490 |
> +----------+
> 1 row in set (14.96 sec)
> 
> 
> 
> 
> -----Original Message-----
> From: mogilefs-bounces at lists.danga.com
> [mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Brian Lynch
> Sent: Wednesday, April 30, 2008 3:36 PM
> To: dormando
> Cc: mogilefs at lists.danga.com
> Subject: RE: Replication Oddities
> 
> I figured something out.  Old replication compatibility is enabled for
> some reason (it is not in the configuration).  This explains the large
> number of entries in file_to_replicate and the reason our database has
> been periodically hanging with the following query (SELECT fid FROM file
> WHERE dmid='1' AND classid='1' AND devcount = '1' AND length IS NOT NULL
> LIMIT 1000). Here is the configuration file on all servers: 
> 
> db_dsn DBI:mysql:mogilefs:hsqlmog00
> db_user mogile
> db_pass mogile
> conf_port 7001
> listener_jobs 5
> 
> Here is the output from mogadm settings list: 
> 
> hsv4s22cen03 /usr/lib/perl5/site_perl/5.8.8/MogileFS blynch $ mogadm
> settings list
>          enable_rebalance = 1
>            schema_version = 9
> 
> 
> Looking into the code to figure out why it is the case.  Note that this
> is still the latest version (2.17) from CPAN. 
> 
> - Brian
> 
> 
> -----Original Message-----
> From: mogilefs-bounces at lists.danga.com
> [mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Brian Lynch
> Sent: Wednesday, April 30, 2008 12:37 AM
> To: dormando
> Cc: mogilefs at lists.danga.com
> Subject: RE: Replication Oddities
> 
> Dormando,
> 
>   There are roughly 58 million entries in file_to_replicate of a total
> 71 million files. It seems like the Replication Worker is for some
> reason not deleting completed rows (though the code path exists).  Note
> that only 570K entries in file_to_replicate have failcount > 0. Only 9
> entries have a nexttry = ENDOFTIME. 
> 
> mysql> select count(*) from file_to_replicate;
> 
> +----------+
> | count(*) |
> +----------+
> | 58395828 |
> +----------+
> 1 row in set (2 min 6.26 sec)
> 
> Best,
> Brian
> 
> -----Original Message-----
> From: dormando [mailto:dormando at rydia.net] 
> Sent: Monday, April 28, 2008 12:50 AM
> To: Brian Lynch
> Cc: mogilefs at lists.danga.com
> Subject: Re: Replication Oddities
> 
> 
>>>>> Would it be possible to purge portions of the file_to_replicate
>> table?  I'm currently pulling out known good replications to identify
>> bogus entries. 
> 
> You should sample rows out of file_to_replicate, see if the nexttry is
> set to 2147483647 - and that all of the paths are invalid.
> 
> I've never outright removed rows from file_to_replicate, _unless_ I have
> verified that the fid is gone, ie:
> 
> - Has no matching 'file' entry.
> - Has no matching 'file_on' rows (odd bug, haven't fixed yet).
> - Has file row, file_on row(s), but all paths are dead. 404's.
> 
> If at least one of those conditions are met, the fid can be removed from
> file_to_replicate, and you might want to see why they disappeared to
> begin with. Otherwise you do not remove the row.
> 
> If the nexttry is off in the future but not equal to ENDOFTIME
> (2147483647) you can try UPDATE'ing those rows to UNIX_TIMESTAMP() and
> see if they get chewed through. If not, you should find out exactly
> what's going on. Odds are one of the three conditions listed above has
> happened. If otherwise, you should definitely give a best effort in
> figuring out what it was.
> 
> Yeah, this should be way more automatic. We'll get to it someday, and
> also accept patches ;)
> 
> -Dormando



More information about the mogilefs mailing list