Encoding correctness in types
This article will discuss effective use of types to catch some common problems at compile time not normally found by the typechecker.
The typechecker is the most important tool in the programmer's arsenal. Types enable you to structure code in a more maintainable way and prevent entire classes of errors from ever occurring at runtime, many of which can be insidiously difficult to debug. If you've written anything extensive in assembly language, you can understand the frustration of dealing with misinterpretations of data leading to corruption of program state. Researchers like to say "well-typed programs never go wrong".
But what does "go wrong" mean? One answer is: "more than you think". By making effective use of the type system to encode invariants about your program, you can use the compiler to catch more errors at compile time that you currently catch at run time.
A simple example is SQL injection, where insertion of user-supplied data into a SQL statement string can enable attacks where the user modifies the nature of the statement with carefully chosen data. At first the solution seems simple: double your apostrophes to escape them, and then the value cannot be misinterpreted. This can be trickier than sounds, though, since in a complex program you might not be able to identify which values come from an untrusted source, directly or indirectly, which are safely produced from trusted sources, and which are untrusted data that has already been properly escaped. These three types of data travel from method to method, following a winding path that obscures their source. The compiler is of no help, since it sees them all as just strings.
The solution, called "tainting" in languages like Perl and Ruby, is to make them not "just strings", but to create two types of data: tainted data, which is unsanitized data derived from an untrusted source, and untainted data, which is either derived from a trusted source or already sanitized. For our example, suppose we're performing our queries with SqlDataReader. We create a wrapper class for it which has an identical interface except that wherever SqlDataReader takes a string, our wrapper takes a SafeSqlString. This new type is simply a wrapper for string that can only be generated in a few strictly limited ways:
- static SafeSqlString AssumeSafe(string): Used for trusted data, just wraps it
- static SafeSqlString Escape(string): Escapes apostrophes; used for untrusted data.
- SafeSqlString operator+(SafeSqlString): Concatenates two SafeSqlStrings
- static SafeSqlString Format(SafeSqlString format, params object[] args): Takes a safe format string and a list of arguments of safe types such as SafeSqlString, int, and double, and produces a new SafeSqlString.
What does this buy us? Now, as long as we use our new SqlDataReader wrapper for all database access, we are essentially immune to SQL injection: the only way it could sneak in is through AssumeSafe(), which is obviously dangerous and makes the code easy to review. As an added bonus, double escaping can never happen: the data only gets escaped once as it changes from string to SafeSqlString, and Escape() cannot be applied to SafeSqlString.
Because these types flow through the program, we can specify that a particular method always returns a SafeSqlString, or assumes that its arguments are SafeSqlStrings, or assumes that some private variable is always safe, and enforce this statically instead of by convention. This is important, because it allows us to use AssumeSafe() at a point where it's obvious that the string is safe, as opposed to five calls up the call stack where we really have no idea who else might be calling us.
What else can we encode in types? All sorts of things! Here are some more examples:
- We could have a type PositiveInt for positive integers, which will throw a runtime error if constructed with a nonpositive integer. Now, if a method receives a PositiveInt, it can safely pass it on to another method expecting a positive integer without any additional runtime checks. This trick works with any object where you want to enforce a complex runtime property on an object but only want to check it once (or only when it might change). In unmanaged code, it's particularly useful for checking buffer/string lengths.
- Some binary tree algorithms require that every node has zero or two children, never one. We can represent this using a subclass hierarchy, where the abstract base class is TreeNode and the subclasses are Leaf and InternalNode. Leaf contains no data at all, while InternalNode contains a piece of data and two (non-null) references to TreeNodes.
- There are many other properties of data structures that can be encoded in types. For example, in a red-black tree, red nodes cannot have red children. We can enforce this statically by creating a subclass hierarchy where BlackNode objects contain references to the abstract base class Node, while RedNode objects contain only references to BlackNode objects. Similarly, the root must be black, so we just use a BlackNode reference for the root.
- If you're writing a sorting algorithm, it might not be clear that the array you get back contains a rearrangement of the same elements as the array you passed in. There's no quick way to check this, even at runtime. On the other hand, if instead of an array you pass in an object whose only valid non-read-only operation is to exchange two elements, then this becomes statically obvious.
In short, wherever you find yourself wanting to keep track of a static property as it flows through a program, types are a great way of doing so - use them to catch errors sooner.
In programming languages with more sophisticated type systems, you can encode even more information in types. Taking this to the extreme are dependent types, a relatively new system of types based on values that are extremely difficult to typecheck but so powerful that they can express many correctness properties of real programs. A number of research languages such as Dependent ML and Epigram already implement dependent types. With the power of systems like these, maybe in a few years the primitive C-based type systems of C# and C++ will look a lot like assembly language does now. But before we can take advantage of new technology, let's start making better use of what we have - encode your program's properties using types.