Compartir a través de


Managing data motion during your deployments (Part 1)

Continuing with the VSTS: DB Best Practices post series, let’s take a look at managing data motion in your deployments. This will be Part 1 of a 2 part post. In the first post we will take a look at managing data motion necessary to make table schema modifications while hanging on to your existing data. In Part 2 we will look at organizing your data motion scripts and managing data motion for projects where you may be maintaining multiple versions of a database schema.

 

There will be times when you deploy changes to your database that you will need to manage data motion to complete your deployment successfully. For simple table changes VSTSDB handles the data motion for you. An example of a simple change is when you add a new null-able column in the middle of a table. When you deploy your database project against a database, a table with the modified schema is created in the database (CREATE), the data rows are copied from the old table to the new table (COPY), the old table is deleted from the database (DROP), and finally the table is renamed back to the original name (RENAME). This is called the CREATE-COPY-DROP-RENAME (CCDR) strategy. Gert has a post from some time ago that reviews this strategy and why it is used. (I would also suggest the same post for folks needing a refresher as to how and why you use VIEWS to abstract tables.) The CCDR strategy is used when no other option is available to add a new column to the table. In SQL Server, the only column addition that can be made to an existing table without recreating the table is to add a column to the end of a table that is either NULLABLE, has a DEFAULT, or is of an IDENTITY or a TIMESTAMP column type. While this covers the need to add a column tables in many scenarios, there are just as many or more that it does not.

 

I don’t know about you, but back when I designed databases for a living the order of columns within a table was important to me. It wouldn’t keep me up at night, but as I was developing an application I would never let the hurdle of reloading a table discourage me from reordering the columns within a table into nice sequence consistent across the rest of the database. Something about having the primary key be first column and mod fields the last few columns gives you a warm and fuzzy that all is well in the universe.

 

There are schema changes where data motion is the primary purpose of the change, for example, when a table is normalized resulting in the columns being split into 2 or more new tables. While you want to minimize data motion on a database in “production” status as it can be risky and requires substantial resources to move large sets of data around, it tends to be inevitable process of database management. Due to the risk and complexity of data motion, well tested custom scripts is what is needed to complete the process with a high degree of confidence.

 

We often receive questions about how this can be done using VSTSDB. Customers often believe they have missed an elusive option hidden somewhere in the project system. The fact of the matter is that data motion is challenging to automate and comprehend the intention of the user. When a new non null-able column is added to a table there is often a motive for this change that entails specific logic or external data to populate the new values. This is best managed by the developer/DBA at the time the change is executed against the database.

 

With VSTSDB you manage data motion in the pre and post deployment scripts of your project. As you continue to use VSTSDB you will find the pre and post deployment scripts as indispensible to execute custom actions during the deployment process and data motion is a perfect example. To manage data motion during deployment you add scripts to the pre deployment script to temporarily persist the data of the table that will be recreated into a temp table, table variable, or table in the database. This allows the schema modifications to be made against the target table during execution of the deployment plan without losing any data. In the post deployment script you put the data back into the table after the schema modifications are complete and perform any necessary cleanup. That’s it in a nutshell. Sounds simple, but there a handful of things to consider and each are unique from database to database.

 

Before we get into an example of data motion, we should talk about the preventative measure enabled by default that protects your database against data loss during deployment. The measure is a deployment option called the “Block incremental deployment if data loss might occur” and can be found in the Database.sqldeployment properties page as shown below.

 

 

This option will block changes made in deployment that would cause a possible data loss based on data motion, not necessarily schema changes. That’s right, it does not block all data loss as some data loss is intentional. To illustrate this consider a changes made to our Customer.Customers table discussed in the last post <link>. If you were to change the length of the Email column of the Customer.Customers table from 320 to 100 and the option is enable the change would be blocked in deployment. You will get an error message like:

X Rows were detected. The schema update is terminating because data loss might occur

========== Deploy: 0 succeeded, 1 failed, 0 skipped ==========

If you disable the option and deploy again you will get:

The type for column Email in table [Customer].[Customers] is currently NVARCHAR (320) NOT NULL but is being changed to NVARCHAR (100) NOT NULL. Data loss could occur

    Altering Customer.Customers...

========== Deploy: 1 succeeded, 0 failed, 0 skipped ==========

 

If you enable the option again and remove the Email column all together and deploy you will get (This assumes the UNIQUE Constraint on the Email column has been removed from the table in the database. If you only remove it from the project deployment will fail with a dependency failure. More about this in a forthcoming post):

 

    Altering Customer.Customers...

========== Deploy: 1 succeeded, 0 failed, 0 skipped ==========

 

So dropping a column from a table does not get blocked, but changes to modify the schema of the table that may introduce the risk of data loss to data that would exist after the deployment will get blocked. Adding a NOT NULL column to the middle of the table will get blocked by the option because this change requires data motion since the table will have to be recreated. What is import here is that you will usually want this option “ON” for your deployments, but you must comprehend this check when managing data motion. Turing it “OFF” to deploy a schema change to only one table may be risky as it is possible that another change has snuck under the wire that may cause data loss. This option is for the whole deployment, not per object.

 

Let’s get into an example of data motion using our Customers.Customer table. Let’s assume we are deploying version 3 of the schema against a database which currently has version 1 (the versioning of this schema is described in a previous post). We are jumping to version 3 as deploying version 2 against version 1 does not require custom data motion scripts for the reasons noted above. Version 1 of the Customer.Customers table looks like this:

CREATE TABLE [Customer].[Customers]

(

      [FName] NVARCHAR(50) NOT NULL,

      [LName] NVARCHAR(50) NOT NULL,

      [Email] NVARCHAR(320) NOT NULL,

      [Created] DATETIME NOT NULL DEFAULT GetDate(),

      [Updated] DATETIME NOT NULL DEFAULT GetDate(),

      [Deleted] DATETIME

);

 

Not let’s assume the target database contains existing customer data. Since the new column, IsKeyCustomer, does not allow NULLs we will need to manage data motion so the deployment can be completed and the new Customer.Customers schema effected. To manage this the data motion we are going to start off by creating a script to make a copy of the existing data into a new table. I prefer to use actual tables within transactions as opposed to table variables or temp tables in case deployment fails I can easily recover from the failure . You could do recovery in script with temp tables, but that introduces more risk and is more effort than it is often worth. You may also need to disable any constraints that may depend on rows in the affected table. Finally you will want to delete all rows from the original table so the Block data loss.. option does not trigger a deployment failure. I also tend to sprinkle a few PRINT statements around so we can see where we are in deployment later on. We end up with a script that looks like:

 

PRINT 'STARTING PREDEPLOYMENT SCRIPT...'

SET NOCOUNT ON

BEGIN TRY

      BEGIN TRAN PREDATAMOTO_CUSTOMERS_TABLE

            PRINT 'BACKING UP [Customer].[Customers]'

            SELECT [FName],[LName],[Email],[Created],[Updated],[Deleted]

                  INTO [dbo].[xxxDATAMOTOxxxCustomers]

            FROM [Customer].[Customers]

            PRINT 'DELETING [Customer].[Customers]'

            DELETE [Customer].[Customers]

      COMMIT TRAN PREDATAMOTO_CUSTOMERS_TABLE;

END TRY

BEGIN CATCH

      ROLLBACK TRAN PREDATAMOTO_CUSTOMERS_TABLE;

      PRINT 'PREDATAMOTO_CUSTOMERS_TABLE TRANSACTION ROLLED BACK'

      PRINT CAST(ERROR_NUMBER() AS CHAR(5)) + ERROR_MESSAGE()

END CATCH

SET NOCOUNT OFF

PRINT 'PREDEPLOYMENT SCRIPT COMPLETE'

 

We now have the pre deployment script that readies the table for schema modification. Let’s put together the script to fix up the table after the deployment does its thing. In the post deployment script we will simply put the data back into the Customer.Customers table, but also set the value of the new column with a sub query. For our sample we will look up customers that are big spenders from the Accounting.CustomerInvoices table and use that to set the value of this new column on the customers table. It should be based on paid invoices or payments received, but we are ok with just knowing the big spenders not necessarily the big payers. To wrap up the post deployment script we will clean up after ourselves by dropping our temporary storage. This leaves us with the script:

 

PRINT 'STARTING POSTDEPLOYMENT SCRIPT...'

SET NOCOUNT ON

BEGIN TRY

      BEGIN TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

            PRINT 'RESTORING [Customer].[Customers]'

            INSERT [Customer].[Customers]

            ([FName],[LName],[Email],[IsKeyCustomer],[Created],[Updated],[Deleted])

                  SELECT

                        [FName],

                        [LName],

                        [Email],

                              CASE

                                    WHEN EXISTS

                                    (SELECT SUM(InvoiceTotal) FROM [Accounting].[CustomerInvoices]

                                          WHERE [Accounting].[CustomerInvoices].[CustomerEmail] =

                                                [dbo].[xxxDATAMOTOxxxCustomers].[Email]

                                                      HAVING SUM(InvoiceTotal) >= 500000)

                                    THEN 1

                                    ELSE 0

                              END,

                        [Created],

                        [Updated],

                        [Deleted]

                  FROM [dbo].[xxxDATAMOTOxxxCustomers]

                        ORDER BY [dbo].[xxxDATAMOTOxxxCustomers].[Created]

            PRINT 'DROPPING [dbo].[xxxDATAMOTOxxxCustomers]'

            DROP TABLE [dbo].[xxxDATAMOTOxxxCustomers]

      COMMIT TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

END TRY

BEGIN CATCH

      ROLLBACK TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

      PRINT 'POSTDATAMOTO_CUSTOMERS_TABLE TRANSACTION ROLLED BACK'

      PRINT CAST(ERROR_NUMBER() AS CHAR(5)) + ERROR_MESSAGE()

END CATCH

SET NOCOUNT OFF

PRINT 'POSTDEPLOYMENT SCRIPT COMLETE'

 

We are now ready to deploy our project. Below is the schema of the table we deployed and in the output window an indication of success.

 

 

We now have the database target containing version 3 of the Customer.Customers table. The table defined in your project and the target database now match. The data is intact and we know who our key customers are.

 

Now let’s imagine the same scenario, but the Customer.Customers table is very large. It may have hundreds of thousands of rows and possibly even larger tables reference these rows through a foreign key constraint. The previous data motion approach may not be appropriate. An alternative would be to define the new non null-able column as the last column of the table and include a default to get it on to the table. The default will let you modify the table in place instead of having to recreate the table which would introduce the need for data motion. In the post deployment script you would only need to update the value of the new column and remove the default constraint on the table. The downside to this approach is that you will need to go back and remove the unnecessary default constraint from your source code. To alter our data motion for approach for a very large table we would make the following modifications:

 

With the Customer.Customers table in the project system, move the new columns to the end of the table and add a default constraint to the new non null-able column.

CREATE TABLE [Customer].[Customers]

(

      [FName] NVARCHAR(50) NOT NULL,

      [LName] NVARCHAR(50) NOT NULL,

      [Email] NVARCHAR(320) NOT NULL,

      [Created] DATETIME NOT NULL DEFAULT GetDate(),

      [Updated] DATETIME NOT NULL DEFAULT GetDate(),

      [Deleted] DATETIME,

      [Id] INT NOT NULL IDENTITY(1,1)

            CONSTRAINT [PK_Customers] PRIMARY KEY,

      [IsKeyCustomer] BIT NOT NULL

            CONSTRAINT [DF_Customers_IsKeyCustomer] DEFAULT (0)

);

 

Remove the script from the pre deployment script file. Update the post deployment script with the following script.

 

PRINT 'STARTING POSTDEPLOYMENT SCRIPT...'

SET NOCOUNT ON

BEGIN TRY

      BEGIN TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

            PRINT 'UPDATE [Customer].[Customers].[IsKeyCustomer]'

            UPDATE [Customer].[Customers]

                  SET [IsKeyCustomer] =

                              CASE

                                    WHEN EXISTS

                                    (SELECT SUM(InvoiceTotal) FROM [Accounting].[CustomerInvoices]

                                          WHERE [Accounting].[CustomerInvoices].[CustomerEmail] =

                                                [Customer].[Customers].[Email]

                                                      HAVING SUM(InvoiceTotal) >= 500000)

                                    THEN 1

                                    ELSE 0

                              END

                  FROM [Customer].[Customers]

            PRINT 'DROPPING DEFAULT CONSTRAINT'

            ALTER TABLE [Customer].[Customers]

                  DROP CONSTRAINT [DF_Customers_IsKeyCustomer]

      COMMIT TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

END TRY

BEGIN CATCH

      ROLLBACK TRAN POSTDATAMOTO_CUSTOMERS_TABLE;

      PRINT 'POSTDATAMOTO_CUSTOMERS_TABLE TRANSACTION ROLLED BACK'

      PRINT CAST(ERROR_NUMBER() AS CHAR(5)) + ERROR_MESSAGE()

END CATCH

SET NOCOUNT OFF

PRINT 'POSTDEPLOYMENT SCRIPT COMLETE'

 

This is a good place to stop for part 1. In part 2 we will look at how to organize your custom pre and post deployment scripts and configure them to execute only when necessary. This is especially important if you are maintaining multiple versions of the same schema.

Comments

  • Anonymous
    March 31, 2009
    So far, so excellent. I'm sitting on the edge of my seat for the next part. I would have tried using the pre & post deployment scripts but I wasn't sure how to version them and use them across environments. Now I'm getting jazzed.
  • Anonymous
    May 19, 2009
    The problem with the first pre- and post-deployment script option at the top of the post is that the script is only valid for deployment scenarios up to V3 of the schema.  After this, the script is potentially dangerous as it overwrites any post V3-deployment changes to IsKeyCustomer.  A more specialized data management process for both seed data and data motion is required in upcoming release of DataDude.  Schema stuff is really getting there, but data management is still pretty much in the dark ages with manually crafted scripts...
  • Anonymous
    February 05, 2014
    The comment has been removed
  • Anonymous
    February 23, 2015
    i was thinking about this problem just like this post, but the problem is, you will need to compare and check for schema differences from tables to tables, analyse them and create a data migration script. then put that chunky script back before publishing. this is a double work. ssdt teams should really add a feature like "add script during publishing (not pre nor post), right after adding new column, and right before dropping obsoleted ones"any ideas to workaround?