An .NET archival tool using Serialization and Compression/Encryption IOStreams

Recently I was looking at writing a file archival, encryption and compression utility in managed code. Since I didn't want re-invent anything, I wrote a simple pipe-line framework, where each section uses the output of the previous stage's stream. The head file stream is then used for any read or write IO. Each stage is configured to decrypt/encrypt, compress or expand, depending on the direction of the IO (read, or write).

FileStream <-- CompressionStream <--- EncryptionStream (Head, writing)

FileStream --> DecompressionStream ---> DescryptionStream (Head, reading)

 

This is all very simple, and works like a command shell IO pipe. I can write data to the top of the stream and it filters down to the bottom layers, where it eventually ends up in the file system (or in memory). When I was done, I thought it might be interesting to share regarding how .NET serialization can be combined with the IOStreams to make some cool technology with little work.

 

The component I specifically needed was an archiver, like TAR, plus the other pipeline stages that I use on my command line on my Linux box (gasp!)

 

tar cf`` - * | gzip -c > tarfile .tar.gz

 

Since the .NET Framework already offers serialization, I decided to extend that to serializing objects that represented the files, using serialization. Lets not exaggerate; TAR is much cooler than anything I wrote (and is much harder to use). But since Microsoft doesn’t supply a nice managed/unmanaged archiving tool, and WinZip is not on my test machines and doesn't offer encryption, I figured it would be faster just to write one, with the added bonus that I could make it do custom actions by plugging in different stages.

 

By creating a class that encapsulates the file's full path, file name (in this case only part of a file's data, not including alternate streams, or ACLs), and data, we have a simple archiving mechanism. You need to write some code that is trigged by the OnSerializing attribute and the Deserialized attribute: to read in the file data (into the fileData member) from the correct file, and then to write it out to the newly created/over-written file upon deserialzing (unpacking the archive).

 

[Serializable]

public class FileArchiver

{

            private String fileName;

            private byte[] fileData;

 

            ...

}

 

The archive is simply a serialization of a List<FileArchive> instance, which contains all of the data stored in the respective files.

 

List<FileArchiver> filesToAddToArchive;

serializer.Serialize( stream, filesToAddToArchive);

 

Make sure you use the BinarySerializer, for smaller file sizes and faster archival run times! XML is cool, as cool as UTF-8 can be, but way less efficient as opposed to binary formats. I once timed the difference between a binary protocol and XML-based protocol. I think there was a 1000% difference in there transaction times. XML processing is slow, even if writing the wordy output is not necessarily much slower.

 

Now, what about encryption? Asymmetric encryption was important for the application, since using a shared key is less secure, and for my application I wanted to use simple RSA keys, as opposed to using certificate. These keys are usually expressed as two parts, a public and private key, and stored in a file, or a Windows Key Container (more secure). To do the actual encryption, the encryption method generates a temporal key, which is then used to encrypt the data. This key is then itself encrypted via the public part of the key, and embedded into the file, or sent apart.

 

In the application we want to encrypt in-line, meaning by only processing the file once. Plus this is a pipeline, and it must read and write in one go, and we are already using serialization to serialize the file archive. So logically we need to serialize the encrypted key as well.

 

In the case of archiving, upon initializing the encryption stage reads in the RSA public key and it writes out the temporal key from the symmetric encryption method. This key is used later in the symmetric encryption of the actual data, and you can use any symmetric algorithm you like here. The standard for symmetric encryption is AES, which is required for all government and hospital IT systems, at least in the US. I believe that this will become part of a larger standard as well. RijndaelManaged is the .NET implementation and it is definitely the symmetric algorithm to use. If you are doing any serious encryption, use a 128- or 192-bit key as well.

 

So, if we encrypt the temporal key and Initialiation Vector (IV) with the RSA public key, we can serialize this data into the file stream as well, using the same mechanism. Here is our simple property-bucket class to hold these two byte arrays:

 

[Serializable]

class EncryptedSessionKey

{

            public byte[] encryptedKey;

            public byte[] encryptedIV;

           

            public EncryptedSessionKey() {}

 

}

 

 

Then we can start writing the rest of the data to the archival data stream; and now that the encryption module had written out the encrypted key header, it will not encrypt the data written through it, and thus the output file looks like this, with two sections, each a sort of convolution of the sum of the steps. This includes using the compressor pipeline stage, which is more obvious:

 

Compressed [Serialized EncryptedSessionKey]]

Compressed[Encrypted{Serialized List<FileArchiver>}

 

Cool serialization trick, certainly not novel, but still interesting. Make sure you configure the symmetric encryption to use padding for writes that do not follow block sizes, otherwise you may not be able to write the encrypted data. By default, I believe a padding format is enabled.

 

The reverse of this process has exactly the same steps, and the only difference is that the objects are deserialized and each pipeline is configured to do the reverse of what it was doing before.

Comments