SYSK 81: How compression works
In essence, file-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.
Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data.
For the following sentence:
"Ask not what your country can do for you -- ask what you can do for your country.",
the dictionary will be:
1 ask
2 what
3 your
4 country
5 can
6 do
7 for
8 you
and the compressed sentence will be:
"1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"
While the example above conveys the raw idea of how compression is done, in reality, the compression programs are not splitting source data into words, but rather look for patterns. To achieve good compression ration, not all patterns will become dictionary items. Instead of the dictionary above, it's could be:
1 ask__
2 what__
3 you
4 r__country
5 __can__do__for__you
and the compressed sentence spelled as:
"1not__2345__--__12354"
Now you can see why text files and computer program's source code compresses well (high rate of redundancy), and graphics and MP3 files do not (include a lot of unique information).
Some programs are particularly suited to picking up patterns in certain types of files. For example, XML compresses very well with gzip.
If the compression/decompression algorithm lets you recreate the original file exactly, it's called lossless compression. Lossy compression eliminates "unnecessary" bits of information. This type of compression is used a lot for reducing the file size of bitmap pictures. While large parts of the picture may look the same -- the whole sky is blue, for example -- most of the individual pixels are a little bit different.
Source: http://computer.howstuffworks.com/file-compression1.htm
Comments
- Anonymous
March 13, 2006
Note too, that many A/V and graphics formats - and MP3 as an audio format in particular (as with most MPEG derived formats) don't compress well because they have already been compressed as a part of their file format definition. The same reason why a CAB or ZIP file doesn't compress well. You can rarely squeeze more out of a compressed file by compressing it again - even with a second algorithm. - Anonymous
March 23, 2006
Your blog is really useful because as an application developer, I come across so many concepts and technologies that its hard to know everything.. Your blog is very resourceful and your approach to explaining tech using simple definitions and examples is neat.. Thank you for doing this and keep up the great work.
I had a question about the compression.. so is it possible that one might end up with larger sized file if the text file contains unique words all along...
Thanks
- Ravi - Anonymous
March 23, 2006
Thanks for the kind words!
Yes, you're absolutely right! If a text file contains all or large amount of unique words, the compressed file will likely be larger than the original.