Memory + GPG

Thu Aug 30 13:24:33 UTC 2007

My comment only applies if you change the chunk_size parameter between
backups. In this case any file that was split into chunks on a prior
backup, OR will be split into chunks on the new backup, will be backed
up again (assuming timestamps changed or brackup digest file is lost).

Example:

 # first pass
 big-file.data = 1MB
 chunk_size = 512KB
 backed up as two chunks

 # Second pass
 chunk_size = 256KB
 will re-back up as 4 chunks

Example 2:

 # first pass
 big-file.data = 1MB
 chunk_size = 2MB
 backed up as one chunks

 # Second pass
 chunk_size = 1MB
 chunk will be detected, no backup occurs because file
 still fits within the chunk size.

Now this only applies if the digest file is lost or the file stats have
changed enough to for Brackup to backup the file again. If brackup does
not detect a change in the file then no backup will occur at all.

It is probably a good time to mention that Brackup's chunk mechanism is
excellent for re-using chunks between files and not so good at reusing
chunks within a changed file. Chunks are at fixed positions so if a file
has a single byte prefixed to it then ALL of its chunks will differ.
This is unlike rsync that does a moving checksum and is good at picking
up changes within a file. Adding rsync like checksums to brackup is
probably possible but is on the far-future list.

This chunking mechanism is why it is usually more efficient to backup
snapshots of databases instead of SQL dumps (which are the common/easy
way). The binary files tend to change in a way that do not reposition
all the data while the SQL dumps are often significantly different in
their offsets. Of course, very large blocks sizes wipe out any advantage
of chunks on a database like file.

I think I have this right, but since I did not write Brackup I may have
some of these details off.

Richard Edward Horner wrote:
> Robb,
> 
> Can you explain what you mean by "Of course this will cause any file
> larger than this chunk size to be backed up again!"
> 
> Are you saying that every time you run backup a given target that an
> unchanged file whose size is greater than the chunk size will be
> backed-up even if it hasn't changed? Or something else?
> 
> Thanks, Rich(ard)
> 
> On 8/29/07, robb <robb at canfield.com> wrote:
>> I just finished some tests to verify my assumptions on Brackup memory
>> consumption. The following are added together with the total RAM
>> consumption varying based on the current chunk size being worked with:
>>
>> * Array of files to backup (plus file stats)
>> * Minimum of chunk_size of file size
>> * Size of GPG chunk data (often much smaller than the file size)
>> * Overhead of digest per file (use to be ALL files)
>>
>> So for a chunk size of 64MB (default) and a file that matches that
>> consumption of RAM will zoom to 64MB + GPG size, assuming worst case of
>> a random file then size would be 128 MB for a very short while. Then the
>> size will be reduce to the size of the GPG chunk, or if GPG is not being
>> used the size of the file chunk, during the time it takes to write the
>> file to the target device.
>>
>> If you are running out of RAM the easiest thing to do is adjust your
>> chunk_size to something smaller. Of course this will cause any file
>> larger than this chunk size to be backed up again!
>>
>> I have some experimental code that removes this per-chunk and might even
>> speed things up a tad. But it needs integration into existing Target
>> types and some more testing.
>>
>> Robb
>>
>>
>>
>> Richard Edward Horner wrote:
>>> Well, yeah, a bigger problem though is in VPS implementations. Most
>>> seem to be designed to be sold, not to work well. They all seem to
>>> dynamically scale CPU but not RAM. It's like you're stuck with
>>> whatever amount of RAM. I know that this is in part because of the
>>> kernel design but I recall some patches being submitted recently that
>>> allow for dynamic scaling of RAM in the kernel.
>>>
>>> Later, Rich(ard)
>>>
>>> On 8/29/07, robb <robb at canfield.com> wrote:
>>>> Agreed and done, dry-run effectively disables GPG
>>>>
>>>> I am not sure how much memory leaking was occurring with GPG for
>>>> multi-gig runs. I may try to test that. But otherwise I found leaking to
>>>> be in in the 20-30 MB range and that's not enough to explain David's
>>>> issue. But I have not yet torn into S3 processing.
>>>>
>>>> One thing that could be a problem is that Brackup retains the chunk in
>>>> RAM! For the default of 64MB that adds up to a whole lot of RAM usage
>>>> VERY quickly. From what I can tell the total would never exceed 2x (so
>>>> 128MB) since encrypted chunks are not read from disk until needed. But
>>>> still, 128MB is a LOT of RAM. The easiest way to handle this is setting
>>>> the chunk size to 5MB or so. Changing the way chunks are handled is
>>>> tricky but I will probably look at it when I try to add alternate
>>>> encryption/compression filters later this week (as time allows).
>>>>
>>>> Another RAM issue is that the file list is built THEN processed. While
>>>> it's nice to know a completion estimate it does chew up a lot of RAM to
>>>> pre-build the file list. Perhaps an option to estimate versus a
>>>> scan/backup combination is in order. That would save another 20-40 MB
>>>> for large backup sets. In addition the pre-scan has some issues with
>>>> files missing, permissions/ownership changes between scan and backup,
>>>> etc. None of these are huge for small sets but for 100 GB and 30,000+
>>>> files it becomes a bit of a problem.
>>>>
>>>> My main issue with RAM is that Brackup is destined for a number of VPS
>>>> systems I have (local and remote). These systems have optimized RAM
>>>> usage and I need something that is as small as reasonable (as my time
>>>> allows) measurable and predictable.
>>>>
>>>> But with all the changes I have done I will need to remeasure RAM
>>>> performance from scratch some day. I know I am using less of it than
>>>> prior version of my code but have not done a end-to-end comparison yet.
>>>>
>>>> Richard Edward Horner wrote:
>>>>> Robb,
>>>>>
>>>>> Awesome work.
>>>>>
>>>>> I had actually suspected there might have been an issue with this when
>>>>> David posted his problem, hence my asking if he was using GPG.
>>>>>
>>>>> On the --dry-run issue, I think ppl expect it to tell you what would
>>>>> be done but not do things. If you read the man page for many ppl's
>>>>> favorite package manager, apt-get, which I think would be a good point
>>>>> for establishing expected behavior, it says:
>>>>>
>>>>> --dry-run
>>>>> No action; perform a simulation of events that would occur but do not
>>>>> actually change the system.
>>>>>
>>>>> "No action" is pretty clear but "perform a simulation" isn't exactly
>>>>> the same as "no action". Sorry, I come from a family of lawyers.
>>>>>
>>>>> Usually when you do a dry run on something, you just want to quickly
>>>>> see what it would do, so not invoking GPG would be beneficial cuz it
>>>>> would be faster and also if you're invoking --dry-run cuz you're
>>>>> trying to back up a failing disk and you're not sure how many more
>>>>> read/writes you're gonna get, having it do anything is not good. I
>>>>> would be inclined to say have it really do nothing more than print
>>>>> messages to the console. If you want further action, there can be
>>>>> another flag.
>>>>>
>>>>> Thanks, Rich(ard)
>>>>>
>>>>> On 8/29/07, robb <robb at canfield.com> wrote:
>>>>>> I had to tear into the GPG processing to locate some temp file
>>>>>> anomalies. I found that temporary files are not always cleaned when GPG
>>>>>> is active, and can accumulate at an alarming rate on large backups.
>>>>>> That's fixed along with some memory leaking (minor) problems.
>>>>>>
>>>>>> Added improved recover code for backups so that an error backing up a
>>>>>> file/chunk no longer aborts the process. Set via the option --onerror
>>>>>> (default is to halt code).
>>>>>>
>>>>>> An error log is now maintained for the run so that if
>>>>>> '--onerror=continue' is given there is still a place to examine for
>>>>>> errors. The name is based on the brackup metafile name with '.err'
>>>>>> suffixed. Logs are maintained for dry-runs as well and they are suffixed
>>>>>> with '-dry.err'.
>>>>>>
>>>>>> I also found that --dry-run does NOT disable GPG. So the GPG process
>>>>>> manager will happily burn CPU and disk creating files that are then
>>>>>> deleted (via the new clean up code). Should GPG be disabled during dry
>>>>>> runs? I suppose testing GPG on every file to see if it works or not
>>>>>> might be useful, but that seems excessive.
>>>>>>
>>>>>>
>>>>>>
>>>
>>
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3237 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.danga.com/pipermail/brackup/attachments/20070830/b9206361/smime-0001.bin