About .bldd files.

------ The Instant Guide To Expanding a BLDD

make
./BLDDexpand  <infile> <outfile>

In practical useage, a BLDD should be contained within some form of compression - possibly gzip, bzip, rar or 7zip - along with this file and (optionally) the BLDDexpand utility.

------ About

BLDD is not, strictly speaking, a form of compression. It's a closely related thing, block-level deduplication. It's practical effect is the same: Data goes in, and if possible less data comes out, in a reversible way. Like compression, it will work on some files much better than others. Files that it works especially well on include disk/media images and tar files with contained files that in turn contain duplicated aligned fixed-size blocks. Especially tar files containing multible disk/media images. Think of things like backups, or archives of thousands of identical or almost-identical files like user areas.

Because BLDD is not truely compression, it is stackable with compression. This means that in practice the smallest way to compact data will usually involve placing files in a .tar file (if there is more than one file to be compressed), BLDD encoding the .tar file, and then 7zip-ing the file. BLDD always goes before the compression, as compressing destroys the repeating block structure that BLDD depends upon. Used alone, BLDD is almost always inferior to 7z - but the two programs in conjuction can achieve compression significently greater than either one alone.

For an example, consider my test tar file containing assorted jpegs, text files, zips and such real-world test data:
85G testhuge.tar
79G teststreamencoder.tar.bldd
76G testhuge.tar.bldd
68G testhuge.tar.7z
65G testnonstream.tar.bldd.7z

As can be seen, 7z (-md=256m) outperforms BLDD, but the two used in conjunction are substantially better than either alone - a saving of 4.5% on this real-world test data set. This is the manner in which BLDD is intended for use: A suppliment to existing compression, not a replacement.


------ The two compactors.

I have written two compactors for making .bldd files: BLDDcompact and BLDDcompactStream. They may be regarded as the 'extreme' encoder and the 'pathetic' encoder, respectively. Only the latter is capable of operating on a stream, but this ability comes at the expense of reduced ability to find duplicate blocks. I have largely abandoned BLDDCompactStream to focus on the main BLDDcompact encoder, but recognise that there is a great deal of potential for performance improvement of many orders of magnitude. It's more a proof-of-concept.

BLDDcompact will produce the smallest, closest-to-optimal files - but cannot be run from a stream on stdin, it must have a file to read. It also has the disadvantage of placing a number of very large files in /tmp/ and performing enough disk activity to slow the computer down. A lot. Please note that you will require a great deal of space in /tmp/ in order for this program to run - if you havn't enough, it could easily fill a small drive.

Its use could not be simpler:
  BLDDcompact <infile> <outfile.bldd> [-bs blocksize]
The block size defaults to 512B, but you can set it higher to speed deduplication if you know data to be aligned on larger blocks. Handy if you're backing up a filesystem and know the allocation unit size.

BLDDcompactstream can be used to compact a stream, and without disk access. This is intended for a different niche - backups. You can use this to backup a harddrive. You can even use it to compact tape backups as they run, in theory. In practice, performance is pathetic, so it's really more of a curiosity. Encoding is unuseably slow, and space savings are minimal.

A typical use would be:
  tar -c <directory> | BLDDcompactstream | gzip -9 > file.tar.bldd.gz.

------

The reference implimentation of the expansion side is called BLDDexpand. It is sufficiently simple that anyone with even minimal knowledge of C should be able to inspect the code and determine it free of malicious content. It is also written for ultra-portability, using no libraries beyond libc and stdio. It compiles under gcc, mingw or the cl compiler from Visual Studio 2010 for Windows, though for some odd reason the latter only if the file has the .cpp extension - even though it should be perfectly valid C.

The lack of a file size field means that all BLDD files will be padded to the nearest multible of the block size - by default 512 bytes. Deal with it. This is not a general purpose compression program - it's made for tar files and disk images, both of which are already guaranteed to be a multible of the blocks size (512 bytes for .tar) in length.


-----

At this point, I would *NOT* depend upon BLDD for any Serious Business without test-extacting and comparing every file. It has not been throughly tested. It appears to work, but this is a one-man project by a hobbyist programmer, not a professional developer. Unless you want to expand and md5-compare to make quite sure that the program is working correctly, there is a risk that an undiscovered bug may corrupt your data. I've run it through numerous test files large and small, but all I can be sure of is that it works correctly for those files, in my environment.

In particular, the Windows version has had practically no testing at all. The extractor should work fine, but the deduper itsself is touch-and-go at best.
