Huffman Coding with F#
I recall my sense of awe when I first wrote a simple compression application using Huffman Coding a few years ago for a school assignment. Compression is one of those things that just kind of feels like magic - you get to take something, and make it smaller without losing any information! It's done so often that it now seems commonplace - but, like so many other programming projects, there is something rewarding about writing your own implementation and seeing it actually work.
When I was in Cambridge a couple weeks ago visiting the Microsft Research-based part of the F# team, I stopped by one of the nice bookstores in Cambridge and ended up picking up a copy of Information Theory, Inference and Learning Algorithms by David MacKay - which has been a truly great read so far. When I got to the section on Huffman coding, it reminded me of how much I enjoyed implementing the algorithm a a few years back, that time in C. So I decided I should try again, this time in F#!
[ Note: If you are more interested in F# than in Huffman Coding, feel free to skip to the comments after the code below - where I'll talk about some of the reasons why F# works so well for this kind of application. ]
Compression with Huffman Coding
Wikipedia has a great description of Huffman Coding - so instead of trying to describe it myself, let me just quote from the Wikipedia topic:
Huffman Compression
"In computer science and information theory , Huffman coding is an entropy encodingalgorithm used for lossless data compression . The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol."
Basic Technique
"The technique works by creating a binary tree of nodes. These can be stored in a regular array , the size of which depends on the number of symbols, n. A node can be either a leaf node or an internal node . Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has n leaf nodes and n − 1 internal nodes."
A Huffman Coder in F#
open System
/// Huffman coding uses a binary tree whose leaves are the input symbols
/// and whose internal nodes are the combined expected frequency of all the
/// symbols beneath them.
type HuffmanTree =
| Leaf of char * float
| Node of float * HuffmanTree * HuffmanTree
/// Provides encoding and decoding for strings containing the given symbols and expected frequencies
type HuffmanCoder(symbols: seq<char>, frequencies : seq<float>) =
/// Builds a list of leafs for a huffman encoding tree from the input frequencies
let huffmanTreeLeafs =
Seq.zip symbols frequencies
|> Seq.toList
|> List.map Leaf
/// Utility function to get the frequency from a huffman encoding tree node
let frequency node =
match node with
| Leaf(_,p) -> p
| Node(p,_,_) -> p
/// Builds a huffman encoding tree from a list of root nodes, iterating until a unique root node
let rec buildCodeTree roots =
match roots |> List.sortBy frequency with
| [] -> failwith "Cannot build a Huffman Tree for no inputs"
| [node] -> node
| least::nextLeast::rest ->
let combinedFrequency = frequency least + frequency nextLeast
let newNode = Node(combinedFrequency, least, nextLeast)
buildCodeTree (newNode::rest)
let tree = buildCodeTree huffmanTreeLeafs
/// Builds a table of huffman codes for all the leafs in a huffman encoding tree
let huffmanCodeTable =
let rec huffmanCodes tree =
match tree with
| Leaf (c,_) -> [(c, [])]
| Node (_, left, right) ->
let leftCodes = huffmanCodes left |> List.map (fun (c, code) -> (c, true::code))
let rightCodes = huffmanCodes right |> List.map (fun (c, code) -> (c, false::code))
List.append leftCodes rightCodes
huffmanCodes tree
|> List.map (fun (c,code) -> (c,List.toArray code))
|> Map.ofList
/// Encodes a string using the huffman encoding table
let encode (str:string) =
let encodeChar c =
match huffmanCodeTable |> Map.tryFind c with
| Some bits -> bits
| None -> failwith "No frequency information provided for character '%A'" c
str.ToCharArray()
|> Array.map encodeChar
|> Array.concat
/// Decodes an array of bits into a string using the huffman encoding tree
let decode bits =
let rec decodeInner bitsLeft treeNode result =
match bitsLeft, treeNode with
| [] , Node (_,_,_) -> failwith "Bits provided did not form a complete word"
| [] , Leaf (c,_) -> (c:: result) |> List.rev |> List.toArray
| _ , Leaf (c,_) -> decodeInner bitsLeft tree (c::result)
| b::rest , Node (_,l,r) -> if b
then decodeInner rest l result
else decodeInner rest r result
let bitsList = Array.toList bits
new String (decodeInner bitsList tree [])
member coder.Encode source = encode source
member coder.Decode source = decode source
Pattern Matching
Pattern matching is one of the basic but very powerful features of F#. Using pattern matching allows code to be succinct while at the same time very clear about it's behaviour. Just about every function in the code above uses pattern matching - which is typical of a lot of F# code.
In simple cases, such as in "huffmanCodes" above, pattern matching makes it easy to switch on the possible cases of a union datastructure.
match tree with | Leaf (c,_) -> //... | Node (_, left, right) -> //...
In more complex cases like "decodeInner" above, pattern matching helps to guide your code. You list each shape of input that you know how to handle, and the leaf nodes of the pattern you matched indicate the data needed to define the behaviour of that case. Then the compiler kindly tells you which cases you haven't covered. When I wrote this function originally, I was missing the first case, and got this helpful compiler warning:
Warning: Incomplete pattern matches on this expression. The value '([],Node (_, _, _))' will not be matched
Which was exactly right - this particular input indicates that the user provided illegal inputs!
Pipelining
Pipelining is a nice way to declaratively describe a series of operations to perform on an input. It's philosophically just like pipelining in a command shell - take the input data from the left and pass it as input to the function on the right.
Because the F# libraries provide a good collection of primitive operations to perform on your data, it's easy to describe a transformation as a pipeline of data through a sequence of these operations. For example, you can filter, project, collapse, zip, and re-package data easily and declaratively.
Collections
The code uses the 4 common F#/.NET collections:
- F# "List" : Immutable linked list used for recursive algorithms which walk across a list
- F# "Map": Immutable dictionary used to store the codes for each symbols
- F# "Seq" = .NET "IEnumerable" : Primitive collection interface, used for the inputs
- .NET "Array" : Basic typed arrays used as the output of the encoding
Note that it is easy and common to switch between these collections as needed, using functions like "List.toArray" and "Map.ofList".
"It's a .NET component!" a.k.a. "The Slickness"
When I wrote this code, I started out in "experimentation" mode. I just wrote a bunch of small functions at the top level and used F# Interactive to execute them as I went. But once I had something that seemed to work, I wanted to wrap it up in a class that could encapsulate all the functionality, and provide a nice .NET interface. Here's what I did:
Tab everything in one level
Wrap the code in:
type HuffmanCoder(symbols : seq<char>, frequencies : seq<float>) = // All the code goes here... member coder.Encode source = encode source member coder.Decode source = decode source
I can't really understate how amazing this is. In F#, it is super-simple to smoothly transition from experimentation coding to component design coding. This is something that F# can manage because it is a hybrid functional/OO language intergated carefully into .NET. And because of this, it is really easy to build components of a larger .NET system in F#. I heard this referred to once as "the slickness" of functional OO coding in F#, and I like that term, because whenever I do this it just feels so slick. :-)
In case you are curious, here's what it looks like from C#:
public class HuffmanCoder
{
public HuffmanCoder(IEnumerable<char> symbols, IEnumerable<double> frequencies);
public string Decode(bool[] source);
public bool[] Encode(string source);
}
Summary
I found this code to be a good example of some of the things I really like about F#. The algorithmic nature of the code lends itself well to F#'s syntax and programming style. You can start off just writing a bunch of simple, composable functions, each just a half dozen lines of pattern matching and pipelining. As you are doing this, you can use the F# Interactive to experiment with the code and the algorithms as you evolve them into the right functionality. Then when they are working, you can smoothly wrap them up into a .NET component and expose it off to the rest of your .NET consumers.
Comments
Anonymous
May 05, 2008
HI, We are introducing our technical partner company to the world JSR Solutions. It's a Java based software development company. JSR(Java Specification Request) "Write Content Once, Deliver It To Any Device!". They know how to stand own their words. So of you looking for any company who works in JAVA then don't hesitate in contacting them.Anonymous
May 06, 2008
Luke Hoban has a very nice blog entry that shows an implementation of Huffman Coding with F# . This notAnonymous
May 07, 2008
Your story was featured in Drigg! Here is the link to vote it up and promote it: http://defun.ru/vsjakoedrugoe/Huffman_Coding_with_FAnonymous
May 08, 2008
I wonder how fast this implementation is compared to more standard procedural one in C#. I've heard that F#/ocaml compiles to very fast code, so it would be nice to see how this works out.Anonymous
May 29, 2008
please i need your help now .. i need the implementaion of huffman coding using c language...now>> with best regars!!!Anonymous
July 09, 2008
This years' installment of the ICFP Programming Contest is coming up this weekend. For those who haven'tAnonymous
September 07, 2008
system java program using linked listAnonymous
September 15, 2008
Microsoft F# September 2008 CTPAnonymous
December 23, 2008
Don Syme has announced that F# would ship as part of Visual Studio 2010 in his blog entry, F# to shipAnonymous
June 13, 2009
The comment has been removedAnonymous
June 15, 2009
Luis - Thanks for pointing that out, the code needs a slight change to compile with the most recent F# release. I'll update the code sample. The general change to align the various F# Library "sort" functions is also described in the release notes for the May F# CTP if you are interested: http://blogs.msdn.com/dsyme/archive/2009/05/20/detailed-release-notes-for-the-f-may-2009-ctp-update-and-visual-studio-2010-beta1-releases.aspx Luke