Uniqueidentifier and Clustered Indexes
[This article was contributed by the SQL Azure team.]
I love GUIDs -- the uniqueidentifier data type in SQL Server. I use them for everything, domain names, unique error message, and for primary keys in my database tables, etc… -- don’t get me started. About two years ago, I started using uniqueidentifier for primary keys because of their unique ability to merge.
Merging
When you create a GUID in SQL Server using the NewId() command you are guaranteed that it will be unique across the whole universe. Which means if you have two databases (with the same schema) completely disconnected adding rows to the same table, using a primary key of uniqueidentifier will ensure that they primary keys don’t conflict. In comparison, if the two databases had an identity integer column as the primary key for their tables, they would be very likely to insert the same primary key in both tables.
What does this have to do with merging? It is ridiculously easy to merge the tables of our disconnected databases with each other if they are using uniqueidentifier as their primary keys. On the other hand, it is much harder to resolve conflicts when merging between the identity integer primary keys and updating the foreign key references to those primary keys.
Clustering and uniqueidentifier
It isn’t a good idea to create a clustered index on a uniqueidentifier column and generate your GUIDs with NEWID(). The reason for this is that NEWID() generates GUIDs in non-sequential order and SQL Server orders a clustered index sequentially. It will work – SQL Server will let you build a clustered index around a uniqueidentifier column, however it will cause the SQL Server to do unnecessary work and cause performance slowdowns. The reason for this is that to insert data into the middle of a clustered index (out of sequential order) causes SQL Server to make room for the data by rearranging the cluster.
So if it isn’t a good idea then why do people do it? Well, in SQL Server, if I assign a column as the primary key in SQL Server Management Studio it automatically generates a clustered index for you regardless of the data type of that column. Therefore, if you want a table with a uniqueidentifier data type as a primary key you need to change that index to a non-clustered index.
Non-clustered indexes don’t reorder the data as rows are inserted to the table, so they don’t have the performance impact of a clustered index on inserts of non-sequential data.
Along comes SQL Azure
If you love uniqueidentifier data types for primary keys like I love them and you are creating non-clustered indexes on your primary keys then you need to pick a clustered index for the table also. The reason is that SQL Azure requires one (and only one) clustered index on all tables. However, just because SQL Azure requires a clustered index doesn’t mean it should be the primary key column.
Picking a Clustered Index
There are a couple of strategies for picking your clustered indexes; one of the easiest and best is to add another column of data type datetime and use that column for your clustered index. Here is what you need to do:
1. Add the column as data type datetime
2. I usually call it Date
3. Set the Default Value to GetDate().
4. Make it non-null.
5. Create your clustered index on it before you insert data into you.
Adding a default value of GetDate() to the column writes the date and time that the row was inserted into the column automatically. This insures that the data for the row is inserted at the end of the table data – there is no rearranging of the cluster. Adding the data to the end ensures the best performance for inserts.
Another good choice for a clustered index is a column that reflects the ordering of the table in the majority of the select statements. For example, if you have a table called categories and you have an integer column called ordervalue and you always call the table with an SELECT … ORDER BY [ordervalue] then making ordervalue the clustered index makes sense. Here is why: even though it hinders performance to insert a non-sequential ordervalue into the cluster you will get a performance benefit when you call your data, since the rows will be read sequential from the cluster, that it depending on your workload characteristics.
NEWSEQUENTIALID()
I would be remise if I didn’t mention the NEWSEQUENTIALID() function which not supported in SQL Azure. If the NEWID() function generates unique non-sequential uniqueidentifier than NEWSEQUENTIALID() function generates unique sequential uniqueidentifier. The only trick to NEWSEQUENTIALID() function is that the GUIDs are generated partial based on the network card of the computer.
This means that you can successfully have a uniqueidentifier as a primary key column and use that primary key column the required clustered index for SQL Server. As long as you use the NEWSEQUENTIALID() function column to fill that column.
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID. For example if you are passing the primary key to the table in a query string on a web browser. See MSDN for more information.
However, NEWSEQUENTIALID() function isn’t support for SQL Azure. If you try to use it you will get this error:
Msg 40511, Level 15, State 1, Line 1
Built-in function 'NEWSEQUENTIALID' is not supported in this version of SQL Server.
Migration
If you are migrating an existing SQL Server database to SQL Azure you need to do these things to work successfully with uniqueidentifier data type as your primary key.
- Convert the generation of GUIDs from NEWSEQUENTIALID() to NEWID()
- Remove all clustered indexes from uniqueidentifier data type columns – this will not prevent you from migrating, however it will give you better performance.
- Pick a column (or add a column) to build your clustered index around (see the recommendations above).
Do you have questions, concerns, comments? Post them below and we will try to address them.