<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Kostas Chatzikokolakis wrote:

<blockquote cite="mid:4AF0CF22.6000608@chatzi.org" type="cite">

  <blockquote type="cite">

    <pre wrap="">2) Implement the hash calculation in the IO access functions which

supply the source chunk data. In this way the serialisation of read

order is implicit and enforced by your IO layer?  This also naturally

deals with certain kinds of changing data correctly (ie tail append).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The problem is that with gpg enabled, 5 processes run in parallel with

different chunks, so they seek at the beginning of the corresponding

chunk and start reading from there. It's not a serial read.

  </pre>

</blockquote>

<br>

Does GPG access the disk itself or is it feed via a pipe?<br>

<br>

You could arrange for a "reader" to handle all the seeks/reads -

obviously as something requests a random access read you need to read

all the data in from the beginning of the file while computing the

rolling hash function, the earlier data either needs to be stored in

memory or spooled to temp storage for later use... Neither is ideal<br>

<br>

<br>

However, I don't really see the requirement for such an algorithm (at

least from a position standing on the sidelines).  A digest of each

chunk needs to be assumed to be a valid way to verify the data of that

chunk.  It should be possible to verify a file by breaking it back into

the original chunks and verifying each chunk matches the source digest

per chunk.  As you point out a digest of a bunch of digests should also

be acceptable (although the cryptographers no doubt have problems with

this), and of course this is much less of a convenient number to work

with because computing it for an arbitary file is a bit complicated and

requires figuring out how it was backed up (chunk sizes, etc) in order

to compute the digest<br>

<br>

<br>

I think some sort of digest algorithm which can be computed in parallel

is the ideal option. I'm not sure if anything actually exists which is

suitable for our purposes though? A quick google search didn't turn up

anything better than plain old CRC32<br>

<br>

Failing that I don't see how you can do any better than "snapshotting"

the source file, either by implementing a single thread reader which

perhaps spools temp data to disk, or by simply copying the file

somewhere before you operate on it, or optionally making use of

filesystem features to snapshot the file (XFS, etc)?<br>

<br>

Ed W<br>

<br>

<br>

</body>

</html>