problem with concurrent changes to a file

Wed Nov 4 00:47:30 UTC 2009

> I don't know the code, but two ideas spring to mind:
> 
> 1) Switch to using some kind of CRC/Hash which can be computed in
> parallel. Always bad to assume otherwise, but tentatively I would
> suggest we don't need cryptographic quality hashes, the goal was to
> detect corruption...  This could include storing the hashes for each
> chunk, rather than the hash for the whole file..

Hm... good points. The full digest is used to ensure that there was no
corruption in the encryption/transfer/decryption. It doesn't need to be
the same algorithm used for the inventory database keys (which have to
be absolutely collision free).

I guess a good option is to concat the sha1 of all chunks and do sha1 on
that string. This is quite strong and can be verified during the restore
(or even later, if you have the .brackup file).

Using CRC is also possible. It is weaker, but an advantage is that it
can be computed without the .brackup file (not sure if this is useful
thought).

> 2) Implement the hash calculation in the IO access functions which
> supply the source chunk data. In this way the serialisation of read
> order is implicit and enforced by your IO layer?  This also naturally
> deals with certain kinds of changing data correctly (ie tail append).

The problem is that with gpg enabled, 5 processes run in parallel with
different chunks, so they seek at the beginning of the corresponding
chunk and start reading from there. It's not a serial read.

With the svn version that reads everything in memory, it could in theory
work as follows: read the first chunk in memory (so no changes after
that), feed it to Digest::SHA1, fork to start encrypting it, then read
the second chunk in memory, continue with sha1, fork again and so on.
But this changes the current logic a lot, and I see no real advantage
wrt the options above.

Moreover it becomes even harder if you don't want to read stuff in memory.

> I
> think this is acceptable in the sense that snapshotting a file requires
> cooperation from the filesystem to be done right, eg LVM/XFS/ZFS
> snapshots or similar.  At least this way though the digest stored does
> actually match the digest of the data uploaded - this seems like the
> main criteria that the digest is supposed to be guaranteeing?

Totally agree.

Kostas