Suspicious behaviour of the delete process

Mon Nov 13 18:15:21 UTC 2006

This is now fixed, and LiveJournal is happily running the new code.  Our
file_to_delete is quickly dropping from 2.2 million files, down to zero.
(we had some devices down/readonly for awhile, and file_to_delete
skyrocketed...)

On Wed, 11 Oct 2006, Andreas J. Koenig wrote:

> >>>>> On Tue, 10 Oct 2006 12:03:40 +0200, andreas.koenig.gmwojprw at franz.ak.mind.de (Andreas J. Koenig) said:
>
> Following up to my own bugreport as I've gained new insights.
>
> The problem only occurs if a mogilefs node is down for a while. As
> soon as it is brought up again, the jam clears up. The reason is this:
>
> A delete job always tries to delete LIMIT files from the beginning of
> the table file_to_delete. If it cannot delete the files because the
> node holding them is unreachable, it finishes the process_deletes
> routine without having done anything. It then signals the caller that
> it has more work to do, the caller immediately calls process_deletes
> again and we observed the high load without getting anything done.
>
> I don't know a solution for that. My first idea was to introduce a
> random offset for the LIMIT clause, so that we do not read from the
> beginning of the table. But this immediately put such a high load on
> the database that normal mogile operation suffered. I tried to remedy
> that by introducing more sleep periods but then the Watchdog started
> to kill the delete job and spawn a new one.
>
> In the end I saw no other quick solution than bringing up the missing
> node again and letting the normal algorithm resolve the congestion
> which it then did quite efficiently.
>
> What do you think?
> --
> andreas
>
>