« Why I hate every program made by ISC | Main | A time for Quebec to clean up its mess »

février 16, 2007

The rsync algorithm applied to S3

Many people would like to see rsync work with S3, unfortunately S3 is a "dumb" file storage that can't run rsync. It's possible to run rsync on a EC2 that accesses your files on S3, but it's complex, (more) expensive and not elegant.

I spent some time reading the rsync algorithm and seeing how it could be adapted to S3 (or any other dumb file storage like a FTP server).

Suppose we have one general purpose computer α and a dumb file storage service S3. We want to minimize the amount of data transfered between α and S3.

Upload

Computer α wants to place file A on S3.

The modified rsync algorithm consists of the following steps:


  1. α splits the file A into a series of non-overlapping fixed-sized blocks of size S bytes. The last block may be shorter than S bytes.

  2. For each of these blocks α calculates two checksums: a weak "rolling" 32-bit checksum (described in the rsync algorithm) and a strong 128-bit MD4 checksum.

  3. α creates a XML file A' on S3, that contains the checksums calculated in step 2, a whole-file MD5 checksum and the metadata for file A

  4. α compresses and encrypt each block of file A, then stores each of them in a different object on S3. Each of these block-object has its MD4 checksum as its name and is stored in a bucket P. If the block-object already exists in P, there's no need to resend the file.

Download

Computer α wants to get file B from S3 and already has file A where file A and B are "similar."

The modified rsync algorithm consists of the following steps:


  1. α splits the file A into a series of non-overlapping fixed-sized blocks of size S bytes. The last block may be shorter than S bytes.

  2. For each of these blocks α calculates two checksums: a weak "rolling" 32-bit checksum (described in the rsync algorithm) and a strong 128-bit MD4 checksum.

  3. α gets XML file B' from S3.

  4. α searches through A to find all blocks of length S bytes (at any offset, not just multiples of S) that have the same weak and strong checksum as one of the blocks of A. This can be done in a single pass very quickly using a special property of the rolling checksum described in the rsync algorithm.
  5. α gets each block-object listed in B' from P that is does not already has. It then decrypts and decompresses them.
  6. α reconstitutes file B using blocks from file A and blocks downloaded from P.

  7. α validates the integrity of the file by verifying it's checksum listed in B'.

  8. Α restores the file metadata as listed in B'.

Similarity with existing products

With hindsight, this looks a lot like what S3InfiDisk is doing, with the exception that S3InfiDisk uses 1 MB blocks while rsync uses 1 KB blocks.

Posted by gfk at février 16, 2007 8:31 AM

Comments

Hi, There is full implantation, that a look at: http://www.s3rsync.com

Posted by: Addady [TypeKey Profile Page] at mai 15, 2008 4:56 AM

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?