update on the low-memory patch

Mon Jan 4 11:23:24 UTC 2010

> I've started to look at your lowmem patches here:
> 
>   http://codereview.appspot.com/135046/show

Hi Gavin,

thanks for looking at it.

> It's a pretty big changeset, and it seems to me that there are at least 3 
> different sets of changes here:
> 
> - the lowmem changes, passing around filehandles instead of file content
> 
> - the concurrent changes patches, for better handling of changes to files 
>   as they're being backed up
> 
> - test changes, primarily the inclusion of check_inventory_db
> 
> Are these more interrelated than I'm seeing, or could they be split up for
> easier review?

You're right, I can split it in three parts. The second (concurrent
changes) will be dependent on the first, since it uses the new
filehandle interface, but you can still review them separately. I'm
pretty busy today and tomorrow, I'll try to send the splitted patches on
Wednesday.

> I did a little bit of testing today with this changeset, doing a medium
> sized backup (3GB) to a local sftp target. RAM usage was great - 25% better 
> than trunk - 

Good. Most importantly RAM should be independent from both chunk_size
and the backup size.

> but it was also a fair bit slower (~30%). Haven't tried other
> targets yet though, or looked at all into the performance issue.

I guess you're not using encryption, right? In this case the performance
hit is caused by the concurrent changes patch, which copies chunks to a
temp file before uploading to target. Let me explain.

The problem is that we want to make sure that the chunk's dig
corresponds to the real data send to the target. In the trunk version,
brackup copies the chunk in memory and computes the dig from there
(apart from the file dig, which is computed from disk and causes one of
the problems that the patch fixes). On the other hand, in the lowmem
patch without the "concurrent" part, if no encryption is used then the
filehandle of the source chunk is used directly to compute the dig and
later to upload the chunk to the target. However the chunk might be
modified in the meanwhile, causing problems. To avoid this, the
concurrent changes patch copies the source chunk to a temp file.

Of course, we could compute the dig on the fly while sending the chunk
to the target, to make sure we digest the real data. However this has a
problem: in the current implementation the target classes need to know
the digest when the upload starts since the dig is used as the filename!

Still, a 30% hit sounds bad for people not using encryption, so we
should fix it. I see 3 solutions:

1. the most "lowmem": Change the target interface so that the upload is
done with a temp filename, compute the dig on the fly and then rename
the target file after the upload finishes. It's optimal but complicated.
Moreover some targets might not support renaming at all (S3 didn't in
the beginning) or need extra permissions (ftp rename is often forbidden).

2. the simplest: keep loading the chunk in memory if encryption is not
used. The good news is that without encryption there's no forking so
we'll predictably consume exactly chunk_size RAM for this, which is not
that bad.

3. Same as 2, but add an option 'low_memory' that dictates to use the
lowest memory possible. In this case we'll copy to a temp file instead
of memory. Btw we could do exactly the same for composite chunks which
need to be constructed either in memory or in a temp file.

What do you think? Personally I find (3) appealing. It shouldn't be that
hard, I could do the changes together with the patch split.

By the way, using the lowmem patch with encryption should have no
performance hit, since a tempfile was used already in the trunk version.
In fact, there should be a gain (probably small) since the trunk version
copies the chunk twice (once to send to gpg and once in gpg's output),
while the patched version sends the data to gpg directly (and computes
the dig on the fly).

Cheers,
Kostas