problem with concurrent changes to a file

Wed Nov 4 11:30:00 UTC 2009

Kostas Chatzikokolakis wrote:
>> 2) Implement the hash calculation in the IO access functions which
>> supply the source chunk data. In this way the serialisation of read
>> order is implicit and enforced by your IO layer?  This also naturally
>> deals with certain kinds of changing data correctly (ie tail append).
>>     
>
> The problem is that with gpg enabled, 5 processes run in parallel with
> different chunks, so they seek at the beginning of the corresponding
> chunk and start reading from there. It's not a serial read.
>   

Does GPG access the disk itself or is it feed via a pipe?

You could arrange for a "reader" to handle all the seeks/reads - 
obviously as something requests a random access read you need to read 
all the data in from the beginning of the file while computing the 
rolling hash function, the earlier data either needs to be stored in 
memory or spooled to temp storage for later use... Neither is ideal

However, I don't really see the requirement for such an algorithm (at 
least from a position standing on the sidelines).  A digest of each 
chunk needs to be assumed to be a valid way to verify the data of that 
chunk.  It should be possible to verify a file by breaking it back into 
the original chunks and verifying each chunk matches the source digest 
per chunk.  As you point out a digest of a bunch of digests should also 
be acceptable (although the cryptographers no doubt have problems with 
this), and of course this is much less of a convenient number to work 
with because computing it for an arbitary file is a bit complicated and 
requires figuring out how it was backed up (chunk sizes, etc) in order 
to compute the digest

I think some sort of digest algorithm which can be computed in parallel 
is the ideal option. I'm not sure if anything actually exists which is 
suitable for our purposes though? A quick google search didn't turn up 
anything better than plain old CRC32

Failing that I don't see how you can do any better than "snapshotting" 
the source file, either by implementing a single thread reader which 
perhaps spools temp data to disk, or by simply copying the file 
somewhere before you operate on it, or optionally making use of 
filesystem features to snapshot the file (XFS, etc)?

Ed W

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/brackup/attachments/20091104/333256ab/attachment.htm