SQL Server: Pragmatic Denormalisation
Many courses cover normalisation of database design very well – and quite rightly so. Rather less coverage is given to Denormalisation.
This article is primarily aimed at the developer who is less experienced with practical database design. A secondary audience are those database developers who simply never encountered a situation where a normalised design impacted performance significantly.
An assumption here is that the reader understands what Database Normalisation is – you can read this article for an overview.
Introduction
What is Denormalisation?
The process of introducing redundancy back into your perfectly normalised design is called Denormalisation. The designer adds a field or table of summary or default to be used data for some circumstance.
You might be thinking.
"Wait a minute I just spent a lot of effort working out how to get rid of redundancy. My design is perfect. Why on earth do I want to put any redundancy back in?"
The short answer is to improve performance. Being the short answer, it does ignore some other reasons though.
Let's consider some approaches and use cases.
Default Value
Imagine a CRM, there is a Customers table and that links 1:M to a Contacts table by CustomerId which is a foreign key in the ContactsTelephone Table to Customer. This is a list of names and phone numbers, country and business hours with another field ContactType indicating the primary contact. The full logic for working out a contact number is to consider which country the contact is in which you need or the time you are going to phone but with a preference for the primary contact. Most of the time you are in the same timezone and when you print a report that will be viewed at some other time anyway. Adding a TelephoneNumber to the Customers table makes it easy to decide which to use. The database designer can add a trigger to ContactsTelephone which updates TelephoneNumber should primary contact change.
Adding Logic
In this situation the extra data is not actually redundant because there is some business rule which specifies which piece of data is best.
Imagine that same CRM is used in a Sales organisation. There is an Account Manager who has responsibility for a given Customer. He/She builds relationships with key staff in Customer organisations and some of the primary contact numbers are private mobile phones. Their managers wish to see who they are using as a primary contact. If an Account Manager leaves then the organisation wants that data recorded. If they have to delegate authority temporarily for some reason then that person may need their data. Everyone else should use the Customer switchboard number so they avoid upsetting an important contact by disturbing their dinner.
In this situation security can be set so that everyone else just gets to see the Customer table whilst the Account Manager and his superiors can see ContactsTelephone. Instead of having a trigger to update the TelephoneNumber field, the Accounts Manager decides which number to enter manually on a case by case. The business logic is in his head and based on his personal knowledge of any given Customer.
Performance
Minimising Joins
Our Customer to telephone numbers example breaks down somewhat but we can see there are some complications in there with working out which to use. The more logic there is and the more joins in a query then the slower it will run.
Maybe we have a database with many records and this contact look up Is just one of a set of complicated joins. Our database is used by an online transactional system and there can be many queries running at once. Maybe this is multi tenancy and 10,000 concurrent users are looking up numbers or something. OK, this is possibly pushing the example. We can imagine some similar situation where there is a complex look up and many joins and we want all the performance we can get. Minimising joins will improve performance and obviating a join which has some time comparison will improve performance again.
In this situation some redundancy can solve your problems.
Avoiding Correlated Subqueries
For an OLTP system the performance cost of some complicated subquery can be performance death over large sets of data. That query to find the latest record with the lowest something or other between variable dates subquery which ran fine for you with 30 test records can dim the lights on a server with 300,000 records. Find a way to add a key or value and run that query separately and overnight if possible.
In this instance a useful technique is to add a foreign key which gives a direct join to the ultimate record a Correlated Subquery would return.
Pre Aggregation
It is a very common requirement to see the last month’s numbers for something or other. Once the accounts are done then those sales figures aren’t going to change. Another fairly common situation is where slightly stale data is “good enough”. Maybe stock levels are reviewed across multiple sites and each site level is somewhat approximate due to wastage/loss. Demand is such that you replenish occasionally and with lead times measured in weeks. You get the sales figures as shops sell product but warehouses only report shipments overnight.
It might be pointless re-calculating something for every query because those numbers aren’t going to change enough to be significant.
Stock levels as they were yesterday might not be exactly correct but they are “good enough” for most purposes and the data can be rolled up by a batch job overnight into the one (redundant) StockSummary table.
Reducing Skill Threshold
Storing aggregated results in a table can make complicated data manipulation far easier. Writing aggregating queries can be difficult and error prone. Not all organisations can afford a database specialist with advanced skills.
SQL is a set based operation so if any query goes wrong there is no loop the developer can place a break point in to investigate the issue.
A series of simple queries can often be constructed which handles an aggregation and writes to a table. This can be an in memory or temporary table or it could be a regular table designed to provide the results of an aggregation. This approach requires less SQL skills and the result from each step can be examined separately thus narrowing where any problem lies.
This approach is particularly useful to the beginner or journeyman developer where a batch based solution will work.
Perhaps an alternative is worth mentioning which is somewhat off topic.
If batch based processing is not an option then nesting queries might be useful.
The technique is not something the expert will totally approve of but the developer can sometimes start with a simple select manipulating a set of data. This can be tested and proven. That select can then be nested within another and tested… and so on.
Data Redundancy vs. Denormalisation
There are several techniques which aren’t technically Denormalisation but deserve mention as approaches the Architect / developer should consider.
Data Warehouse/Mart
This is another database which might be on another machine where data can be rolled up overnight or even log-shipped in real time. It is often preferable to offload work from the OLTP database server to another. Even if this is “just” SQL Express the server can be linked to and queries can span both databases. Linked servers are obviously going to produce slower queries and increase load so the effects of this should be carefully considered and tested under load before being put live. Often more suited to overnight batch processing than OLTP.
Caching
Business Server Cache
Database server resource is a key resource. For some systems the latency involved in database reads and or writes can be a problem.
Business servers are relatively cheap and can hold data in in memory cache, perhaps as an In Memory Database (IMDB) acting as another tier between database and data requests. For some systems caching data in this way is necessary for performance and the likes of Facebook and mobile phone middle systems routinely use this technique.
Web systems whose requirements aren’t quite so demanding can benefit from caching data on an intermediate server and this is particularly useful for static or relatively static data.
Local Cache
Many business systems have lists of data which rarely changes. That list of company regions might never change at all. Countries and Cities are fairly static and nobody invents new colours.
Windows business applications can cache copies of data in a local database or serialise to disk.
A web application can use cookies or the HTML5 offline cache to store data.
Mobile applications can cache in local files.
Local caches can dramatically improve performance of a system by obviating network latency. Even slowly changing data can be cached. A list of change sets with timestamps can be stored in a table, written by triggers. An application can then call a service to query this and find which data is stale, thus only reading from the server when a change makes it necessary.
A more sophisticated pattern could be handled by notifying from the server as changes are made using Signalr or some other push notification.