Suspicious behaviour of the delete process

Brad Fitzpatrick brad at danga.com
Wed Oct 11 16:03:36 UTC 2006


Oh, good, this is a known issue.  Dormando was also hitting this.  We
already have a bug open to fix this this, by introducing a reschedule
parameter/index to the file_to_delete table.  (or more likely a
"reschedule_jobs" table which can reschedule anything.... ?)


On Wed, 11 Oct 2006, Andreas J. Koenig wrote:

> >>>>> On Tue, 10 Oct 2006 12:03:40 +0200, andreas.koenig.gmwojprw at franz.ak.mind.de (Andreas J. Koenig) said:
>
> Following up to my own bugreport as I've gained new insights.
>
> The problem only occurs if a mogilefs node is down for a while. As
> soon as it is brought up again, the jam clears up. The reason is this:
>
> A delete job always tries to delete LIMIT files from the beginning of
> the table file_to_delete. If it cannot delete the files because the
> node holding them is unreachable, it finishes the process_deletes
> routine without having done anything. It then signals the caller that
> it has more work to do, the caller immediately calls process_deletes
> again and we observed the high load without getting anything done.
>
> I don't know a solution for that. My first idea was to introduce a
> random offset for the LIMIT clause, so that we do not read from the
> beginning of the table. But this immediately put such a high load on
> the database that normal mogile operation suffered. I tried to remedy
> that by introducing more sleep periods but then the Watchdog started
> to kill the delete job and spawn a new one.
>
> In the end I saw no other quick solution than bringing up the missing
> node again and letting the normal algorithm resolve the congestion
> which it then did quite efficiently.
>
> What do you think?
> --
> andreas
>
>


More information about the mogilefs mailing list