Distributed Partitioned Views / Federated Databases: Lessons Learned
Introduction
This article contains information about the things we have learned while working with Federated Databases. Before beginning it is necessary to define the terms being used. Included in this article is one solution in production that is using federated servers and Distributed Partitioned Views.
Definitions
Definition 1: Local Partitioned View – A single table is horizontally split into multiple tables, usually all have the same structure.
Example 1:
Original table is named FACT.
New tables are named FACT2005, FACT2006, and FACT2007. (I chose to partition by date, you can choose anything else that makes sense for your application, like branch office).
CREATE VIEW FACT AS
SELECT <select list> FROM dbo.FACT2005
UNION ALL
SELECT <select list> FROM dbo.FACT2006
UNION ALL
SELECT <select list> FROM dbo.FACT2007
You will notice that the view is created with the same name as the original table. This way the applications don’t have to change.
You also have to create a check constraint on the same column in each table to create a partitioning “key”. Read more about this and other very important restrictions in Books on Line at: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/aecc2f73-2ab5-4db9-b1e6-2f9e3c601fb9.htm. There are several good places in Books on Line to read about partitioned views; this link will give you a launching point for the other ones.
Note: Local partitioned views are popular in both OLTP and data warehouse projects. We rarely hear about problems with these, but there are known problems with very complex queries when the optimizer does not do partition elimination. I will not spend much time on local partitioned views in this article as I would like to concentrate on Distributed Partitioned Views.
Definition 2: Cross Database Partitioned View – tables are split among different databases on the same server instance. Notice the three part name using the database in Example 2 below.
Example 2:
CREATE VIEW FACT AS
SELECT <select list> FROM DB2005.dbo.FACT2005
UNION ALL
SELECT <select list> FROM DB2006.dbo.FACT2006
UNION ALL
SELECT <select list> FROM DB2007.dbo.FACT2007
Note: The most frequent question I get from people attempting a view like this that has tables in multiple databases on the same instance is about joins. You don’t lose too much performance with cross database joins. This is something to think about because you will normally join this view to some reference tables for the application. If you carry the reference tables in a database called COMMON for example, then you will most likely see something like SELECT * FROM dbo.FACT JOIN COMMON.dbo.Customer ON ….. WHERE …
You will also notice in this example that I put each fact table in its own database to facilitate easier scale out across servers or instances if you need to do this at a later time.
Definition 3: Distributed (across server or instance) Partitioned View. Tables participating in the view reside in different databases which reside on different servers or different instances. Note the four part name which includes the actual server name (or the cluster name if this is in a Windows Failover Cluster).
Example 3:
CREATE VIEW FACT AS
SELECT <select list> FROM DB2005.dbo.FACT2005
UNION ALL
SELECT <select list> FROM server2.DB2006.dbo.FACT2006
UNION ALL
SELECT <select list> FROM server3.DB2007.dbo.FACT2007
You will notice that the server name is missing from the first server. This view definition in Example 3 exists on server1. You cannot use a linked server to refer to the local server. You might immediately recognize a potential problem and try to create the view in Example 4 on server 2. However, the trick is to change the linked server definitions so that the same view code in Example 3 is deployed to every server.
BAD Example 4: (don’t do this)
CREATE VIEW FACT AS
SELECT <select list> FROM server1.DB2005.dbo.FACT2005
UNION ALL
SELECT <select list> FROM DB2006.dbo.FACT2006
UNION ALL
SELECT <select list> FROM server3.DB2007.dbo.FACT2007
Instead of doing the BAD Example 4, use the style in Example 3 and change the linked server definitions on every server.
Infrastructure architecture for one DPV project
Here is one production project to be used as a reference for DPV. The customer had 3,000 branch offices to automate and determined that the total size representing 3 months worth of data was going to be approximately 3.6TB. They didn’t want to risk putting this all on one server, so their basic approach was to break this into manageable size pieces. There are 6 servers in 2 geographically separated data centers. Each server has 2 instances so that if one server gets too busy they can easily move the second instance to another server. The data is spread out over 12 instances. The servers are 8 socket dual core machines with 16GB RAM. Each instance is responsible for approximately 300GB. Growth is expected to double soon which is why they selected machines this big.
Another reason they bought bigger machines than they need for a normal load is to have a high availability strategy. The three machines in each data center are in one cluster and if one machine goes down another machine can pick up the load. If two machines go down, one machine will do the work of all 3 and they expect that performance will be slow until the problem is fixed. If an entire data center goes down there is no solution in place yet – this is a later phase of the project.
The most important point in the success of this project and makes this project work so well is that they are not using load balancing. The users at each branch are connected directly to a server that contains their data. So even though most of the inserts, updates and deletes are done through the partitioned views, the work is mostly local to one server. There are some corporate users issuing queries that need data from multiple instances and it is expected that most of these queries will touch multiple servers.
Reasons for Distributing:
The main reason for distributing the 3.6TB over 12 servers was to break the data into manageable size pieces. This makes the hard maintenance problem in a VLDB much easier to deal with; i.e. backups, index and statistics maintenance.
If you use table partitioning as well as the partitioned views, then you have even smaller pieces to manage for most of your administrative processes.
WARNING:
Although some federated server projects are successful, we have found several problems during these projects. Fortunately we found solutions for each problem. The warning is that we don’t know if we have uncovered all the problems. Your workload may be different enough that other problems will be discovered. So test, test, test if you decide to distribute a single database over multiple servers.
Two main problem patterns
The two main problems we have found so far are:
1) When a command gets sent to every server when you think it should only go to one server. This happens when the query optimizer thinks it has to check the schema on every server, as in the case when the same collations are not used on all the servers (see note below in the lessons learned section).
2) When cross-server join copies the records from the remote server to the local server and then performs the join. This is called a non-remotable query. The optimizer is pretty good at copying the smaller table (or result set) to the right server before performing the join. Still, it is a situation that should be avoided in order to get the most consistent performance. Try to make all the joins happen on one server (either all on the remote or all on the local) without copying records across the network. See the notes below on advice on how to avoid this.
Here is a list of the things we have learned to do and not do on this project.
Lessons Learned on Distributed Partitioned Views: (multiple servers involved)
- Follow the guidelines in Books On Line very carefully. There are many links from this main one. I recommend reading and re-reading these until you know the subject very well before you start. ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/6e44b9c2-035e-4c88-907f-eef880c5540e.htm. If you are on-line and can get to this site: https://msdn2.microsoft.com/en-us/library/ms175129(SQL.90).aspx, read it and any of the links it has on this page.
- Avoid cross server joins whenever possible. One solution is to replicate all your reference/dimension tables to every server. In a cross server join, the necessary records are copied from the remote server to the local server, then the join is performed.
- Use the same collations in all databases. Otherwise the startup filters are not applied and the queries are always sent to servers with different collation.
- Use the same session settings in all connections. Otherwise startup filters are not applied and the queries are always sent to servers with different session settings.
- Create an index with the partitioned column as the leading column on the index because most of your queries will contain the partitioned column in the where clause. The optimizer uses this index and the associated statistics to do more efficient queries.
- Plan guides do not work for distributed queries.
- After restarting SQL Server, create some startup stored procedures that will run the queries that you need. Otherwise the first user running each query will pay a heavier penalty because it has to touch every server. These startup procedures will also create a local connection pool to each server. Even though creating a connection is fast it will still be better if the first user doesn’t have to wait for this too.
- Don’t forget to do the tip in Books-On-Line: Turn on Lazy Schema Validation. This will give you better performance. It helps to avoid
Comments
Anonymous
June 21, 2007
The Microsoft SQL Server Development Customer Advisory Team has a nice blog post about Distributed PartitionedAnonymous
June 27, 2007
Great overview/summary..thanks!Anonymous
August 28, 2007
Good article. I have a question on Definition 3, around the GOOD and BAD examples for the partitioned view. If server1 hosts DB2005 locally, how will the same view work on server2? I can see how changing the linked server definition gives you indirection for the non-local databases, but what happens to the local database? What am I missing here? The BAD example looks to me to be the 'correct' (but inelegant) solution. Thanks EwanAnonymous
January 03, 2008
great article. I'm also having difficulty implementing definition 3 above: On Server1: when you create a linked server in sql server 2005, in the "New Linked Server" window / general Tab, If I use an alias like for example "Server2", and then choose "Other data source", and then I put in the server ip in the data source, it gets added, but I cannot run queries against it, I get the following error: OLE DB provider "SQLNCLI" for linked server "SERVER2" returned message "Login timeout expired". OLE DB provider "SQLNCLI" for linked server "SERVER2" returned message "An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections.". but when I use the option "Sql Server" in the Sql type, it works but I cannot use the alias "Server2" in that case, hence I will not be able to do it like in definition 3.Anonymous
January 14, 2008
Good article. May you post more articles about this theme. More detail would be welcome. regardsAnonymous
January 28, 2008
Good article. We have VLDB on SQL2000. Our primary table contains slightly less than 1B records, and I am anticipating 450M records per year. Thus we have implemented distributed view on 2000 across 6 clusters, and it works reasonably well. As we are migrating to SQL2005, I am not certain of the maturity of partitioning to migrate to that architecture, as well as having everything on a single server. My first reaction is to architect a hybrid solution, where I would use partitioning on the local servers, and then have a distributed view on top of all the partitions. I have concern implementing the above as MS is saying that distributed view is going to be removed from future releases, but I doubt that partitioning would be able to handle our data volume. Any thoughts would be appreciated.Anonymous
January 28, 2008
Distributed view is not going to be removed from future releases, at least not any time soon. Table Partitioning would handle your data volume as 1B records is not very much in today's world. Of course that depends on what type of hardware you are running on. But if you choose table partitioning you may be able to minimize the impact by marking some of the partitions with older data as read-only. Your hybrid solution could also work well because you can use table partitioning on the local servers as well as use a distributed view on top of all the tables. The Optimizer is very good in this case to only touch the tables and partitions it needs if you supply all the partitioning keys in the queries.Anonymous
January 29, 2008
If identities are not allowed in distributed views. May you help me underestand how do you handle non significant pk fields? do you use a separated table with a counter per table? Where can I get an detailed example? ThanksAnonymous
January 30, 2008
A GUID is the most popular solution instead of using Identities when you want to create a partitioned view.Anonymous
January 09, 2011
Nice to see an article on this older strategy. I will say GUIDs are probably the worst solution when it comes to Clustered Index fragmentation due to their very nature.Anonymous
January 09, 2011
Nice to see an article on this older strategy. I will say GUIDs are probably the worst solution when it comes to Clustered Index fragmentation due to their very nature.Anonymous
January 09, 2011
Nice to see an article on this older strategy. I will say GUIDs are probably the worst solution when it comes to Clustered Index fragmentation due to their very nature.Anonymous
January 09, 2011
I said GUIDS are the most popular solution for DPV, not necessarily the most efficient. Although I like them because of their simplicity. And using a FILL FACTOR of 60 - 70 is a good solution for index maintenance jobs if you are using GUIDS. This minimizes the page splits caused by frequent inserts into the middle of the table/indexes.Anonymous
March 12, 2011
The comment has been removedAnonymous
January 08, 2015
great article, this is also a great guide about using a SQL Server distributed partitioned view to allow transparent querying of multiple Azure SQL databasesp sqlturbo.com/tutorial-how-to-query-multiple-azure-sql-databases-transparently-from-one-on-premise-sql-server