Compartilhar via


System.IO.Compression Capabilities [Kim Hamilton]

We often get asked about the capabilities of the .NET compression classes in System.IO.Compression. I'd like to clarify what they currently support and mention some partial workarounds for formats that aren't supported.

The .NET compression libraries support at the core only one type of compression format, which is Deflate. The Deflate format is specified by the RFC 1951 specification and a straightforward implementation of that is in our DeflateStream class.

Other compression formats, such as zlib, gzip, and zip, use deflate as a possible compression method, but may also use other compression methods. In the case that they use deflate, you can think of these formats as a wrapper around deflate: they take bytes generated by deflate compression and tack on header info and checksums.

Our GZipStream class does exactly that – it uses DeflateStream and then adds header info and checksums specific to the gzip format. The gzip format is specified in RFC 1952.

So, out of the box, we support deflate and gzip formats.

Until we provide support for the other formats, which we plan to do soon, there are partial workarounds that may help you out in some situations, but they're definitely not a complete solution.

Working with zlib

The zlib format is specified by RFC 1950. Zlib also uses deflate, plus 2 or 6 header bytes, and a 4 byte checksum at the end. The first 2 bytes indicate the compression method and flags. If the dictionary flag is set, then 4 additional bytes will follow (which explains why the header will be 2 or 6 bytes). Note that in the wild, preset dictionaries aren't very common (and our classes don't support them).

This diagram from RFC 1950 shows the zlib structure:

            0   1
         +---+---+
         |CMF|FLG|   (more-->)
         +---+---+


      (if FLG.FDICT set)

           0   1   2   3
         +---+---+---+---+
         |     DICTID    |   (more-->)
         +---+---+---+---+

         +=====================+---+---+---+---+
         |...compressed data...|    ADLER32    |
         +=====================+---+---+---+---+

This means that to read a zlib file using only the .NET libraries, you can often just chop off the first two bytes and 4 end bytes and use DeflateStream on the rest of the stream as normal. (It would be better to check the dictionary bit and not attempt to read anything in that case).

Going in the opposite direction isn't as trivial, so I'm not really suggesting to generate zlib files this way. However, a couple people have asked in the past so I'll sketch an overview of that.

To start, you need to know which bytes to add at the beginning. With our deflate implementation, those bytes are 0x58 and 0x85. If you're curious about how this is derived from RFC 1950, see section 2.2 "Data format" and note that we use a window size of 8K and the value of FLEVEL should be 2 (default algorithm).

After that, you need to add the Adler-32 checksum at the end. The checksum will depend on the payload that you're compressing so you need to calculate it programmatically. Because of this, the easiest way to generate the checksum is to subclass DeflateStream and override the Write/BeginWrite methods to update the checksum. Steven Toub's NamedGZipStream article (mentioned at the end) shows an example of creating such a subclass for generating named gzip files.

Working with other compression formats

The big format you're probably thinking about is zip. Currently the .NET libraries don't support zip but the J# class libraries do. The following article describes using these libraries with a C# app.

https://msdn.microsoft.com/msdnmag/issues/03/06/ZipCompression/default.aspx

But if you don't want to rely on the J# class libraries, we'll need to provide a better solution.

Now that you're familiar with some compression specifications, let's focus on zip a little more. A zip specification is here:

https://www.pkware.com/documents/casestudies/APPNOTE.TXT

Notice that zip also allows deflate. Again the same principle applies – there are deflate bytes packaged in a header and footer. This may tempt you into writing a zip reader/writer based on DeflateStream (as described above for zlib), but there are two key differences that make zip more complicated.

First, the zip header contains a lot more information than the zlib header. To read a zip file, you'd definitely have to parse the header to figure out how many bytes to skip over because the header contains variable length items such as a file name.

Second, zip tools actively use different compression methods. For example, use Windows compression tool on a very small text file (with just a few words in it) and then a bigger file, say around 20 KB. Chances are it used no compression (yes, that's an option) for the small file and deflate for the 20 KB file.

Because different compression methods are used, an extension of the zlib technique described above may not help you much if you want to use the .NET libraries to read zip files. You'd definitely have to read the compression method to determine how to proceed. If it's deflate, then chop off the header and proceed as above. If it's no compression, chop off the header and read the bytes as a normal stream of bytes. If it's something else, then the .NET libraries have no built-in support for it.

Additional Note: Using WinZip with our GZipStream

Steven Toub observed in an MSDN article that WinZip can't handle our GZipStream because it requires filename info. He's created a NamedGZipStream implementation that generates files readable by WinZip

https://msdn.microsoft.com/msdnmag/issues/05/10/NETMatters/

Our Future Compression Plans

We'd like to address the shortcomings of our compression library in future releases. The following items are our highest priority compression requests:

  • Support for more formats, such as ones described above
  • Better compression ratio
  • Better compression speed

Are there any others you'd like us to address?

Comments

  • Anonymous
    May 16, 2007
    Support the new format ZIP files that allow >4GB (Both the new WinZip & PKWare formats) and AES Encryption. Support GZIP files >4GB (This would be a simple bug fix).  There should be no limit on how big a gzip file can be.

  • Anonymous
    May 16, 2007
    Other formats : Bzip2 format - patent free, better compression than zip & gzip. RAR Format

  • Anonymous
    May 16, 2007
    Please support LZMA which is the algorithm used in the 7z or 7-Zip format.  Its faster than zip with higher compression ratios.

  • Anonymous
    May 16, 2007
    I second the 4GB limit problem.  I bugged it at https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=94784 over a year ago... Stronger GZip compression would be nice, too... If you can do both of these, I won't have to use the open-source SharpZipLib anymore :) BZip2 support would be welcome.

  • Anonymous
    May 16, 2007
    Can't believe you didn't mention the horrible 4GB limitation!  This really needs fixing because at present solutions are getting developed in .NET that explode without warning in production when files get too big (I know from experience...)

  • Anonymous
    May 16, 2007
    I would love to see the RAR format implemented.

  • Anonymous
    May 16, 2007
    Fixing the limit and RAR support would be brilliant.

  • Anonymous
    May 16, 2007
    The comment has been removed

  • Anonymous
    May 17, 2007
    The comment has been removed

  • Anonymous
    May 17, 2007
    What about the self-extracting feature?

  • Anonymous
    May 17, 2007
    Why point people to Java when they can use #ZipLib?

  • Anonymous
    May 17, 2007
    The comment has been removed

  • Anonymous
    May 21, 2007
    Fastest compression on most types of data:  LZO/NRV (oberhumer.com).  QuickLZ is maybe even faster, but it's not as "proven" as LZO/NRV. These are open source, but there are commercial licenses available, and I'm sure the authors wouldn't mind their compression algorithms being included in the .NET BCL. :)

  • Anonymous
    May 22, 2007
    The 4GB issue is definitely top priority.

  • Anonymous
    May 26, 2007
    I'd like the support for using Stream to read/seek over ISO9660/UDF (cd images), RAR. Writing wouldn't matter so much because: Currently you need to install all kinds of drivers and stuff to deal with filesystem images and as seen from Month of Apple Bugs that's one area with a lot of potential for escalation exploits. With ability to easily deal with the images through .net and powershell in fully managed way you would have both security and ability to easily do processing over images in remote servers.

  • Anonymous
    May 26, 2007
    This was left out: The basis for the need is that with increase in HDD sizes, home storage servers and such becoming common, you'd want to get your content off from legacy medium but without potentially losing data in the process. Using filesystem images is ideal because they are fastest to create, preserve all data 1:1 for archival purposes and by utilizing the .NET Stream you can layer the streams such that you can read any format from the images/archives without copying it in the process. Ideally you don't have winzip or winrar or whatever, you just install support for format similar to a video Codec and it just works whether it's over the network or so on. This all needs to be managed and fully supported in shell!