Replication Oddities
dormando
dormando at rydia.net
Thu May 1 06:24:33 UTC 2008
Yup, that would be it.
I assume if you had tested some of the fids in file_to_replicate you
would've found them lacking copies...
Been bitten by that myself... We can make that better. Nice catch :)
-Dormando
Brian Lynch wrote:
> Ahh, progress...
>
> I updated the configuration file on our tracker hosts to include the
> following line:
>
> old_repl_compat 0
>
> At a glance, we appear to be getting less timeouts for replicate, query,
> and monitor workers. The row count in file_to_replicate is going down
> (despite of the continuing FSCK). We're keeping an eye on progress for
> now.
>
> mysql> select count(*) from file_to_replicate;
> +----------+
> | count(*) |
> +----------+
> | 58604833 |
> +----------+
> 1 row in set (14.81 sec)
>
> mysql> select count(*) from file_to_replicate;
> +----------+
> | count(*) |
> +----------+
> | 58604490 |
> +----------+
> 1 row in set (14.96 sec)
>
>
>
>
> -----Original Message-----
> From: mogilefs-bounces at lists.danga.com
> [mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Brian Lynch
> Sent: Wednesday, April 30, 2008 3:36 PM
> To: dormando
> Cc: mogilefs at lists.danga.com
> Subject: RE: Replication Oddities
>
> I figured something out. Old replication compatibility is enabled for
> some reason (it is not in the configuration). This explains the large
> number of entries in file_to_replicate and the reason our database has
> been periodically hanging with the following query (SELECT fid FROM file
> WHERE dmid='1' AND classid='1' AND devcount = '1' AND length IS NOT NULL
> LIMIT 1000). Here is the configuration file on all servers:
>
> db_dsn DBI:mysql:mogilefs:hsqlmog00
> db_user mogile
> db_pass mogile
> conf_port 7001
> listener_jobs 5
>
> Here is the output from mogadm settings list:
>
> hsv4s22cen03 /usr/lib/perl5/site_perl/5.8.8/MogileFS blynch $ mogadm
> settings list
> enable_rebalance = 1
> schema_version = 9
>
>
> Looking into the code to figure out why it is the case. Note that this
> is still the latest version (2.17) from CPAN.
>
> - Brian
>
>
> -----Original Message-----
> From: mogilefs-bounces at lists.danga.com
> [mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Brian Lynch
> Sent: Wednesday, April 30, 2008 12:37 AM
> To: dormando
> Cc: mogilefs at lists.danga.com
> Subject: RE: Replication Oddities
>
> Dormando,
>
> There are roughly 58 million entries in file_to_replicate of a total
> 71 million files. It seems like the Replication Worker is for some
> reason not deleting completed rows (though the code path exists). Note
> that only 570K entries in file_to_replicate have failcount > 0. Only 9
> entries have a nexttry = ENDOFTIME.
>
> mysql> select count(*) from file_to_replicate;
>
> +----------+
> | count(*) |
> +----------+
> | 58395828 |
> +----------+
> 1 row in set (2 min 6.26 sec)
>
> Best,
> Brian
>
> -----Original Message-----
> From: dormando [mailto:dormando at rydia.net]
> Sent: Monday, April 28, 2008 12:50 AM
> To: Brian Lynch
> Cc: mogilefs at lists.danga.com
> Subject: Re: Replication Oddities
>
>
>>>>> Would it be possible to purge portions of the file_to_replicate
>> table? I'm currently pulling out known good replications to identify
>> bogus entries.
>
> You should sample rows out of file_to_replicate, see if the nexttry is
> set to 2147483647 - and that all of the paths are invalid.
>
> I've never outright removed rows from file_to_replicate, _unless_ I have
> verified that the fid is gone, ie:
>
> - Has no matching 'file' entry.
> - Has no matching 'file_on' rows (odd bug, haven't fixed yet).
> - Has file row, file_on row(s), but all paths are dead. 404's.
>
> If at least one of those conditions are met, the fid can be removed from
> file_to_replicate, and you might want to see why they disappeared to
> begin with. Otherwise you do not remove the row.
>
> If the nexttry is off in the future but not equal to ENDOFTIME
> (2147483647) you can try UPDATE'ing those rows to UNIX_TIMESTAMP() and
> see if they get chewed through. If not, you should find out exactly
> what's going on. Odds are one of the three conditions listed above has
> happened. If otherwise, you should definitely give a best effort in
> figuring out what it was.
>
> Yeah, this should be way more automatic. We'll get to it someday, and
> also accept patches ;)
>
> -Dormando
More information about the mogilefs
mailing list