Dela via


Optimize MERGE statement when only a “slice” of target table is affected

Note: This post assumes reader is familiar with SQL Server 2008 MERGE statement (refresher).

In recent months I had a few chances to use MERGE statement, and I realized that frequently input data applies only to a “slice” or "window" (could say “partition”, but let’s not reuse that term) of the target table. Let me illustrate on example.

Suppose database tracks occupancy of offices in a campus with multiple buildings. Suppose that updates always come in per-building batches. Below shows how to handle such cases in optimal way.

Let’s define our table structure and get some sample data:

--USE <YourDatabase>
DROP TABLE BuildingOccupancy
DROP TABLE ImportBuildingData
GO

-- Define tables

-- This is our target table
CREATE TABLE BuildingOccupancy(BuildingId int, OfficeNumber int, OccupiedBy int,
     CONSTRAINT BuildingOccupancy_PK PRIMARY KEY(BuildingId, OfficeNumber));

-- This is our source data.
-- Note that it does not have BuildingId because business rules dictate that import is done per-building
-- The corresponding BuildingId to be imported will come from a variable
CREATE TABLE ImportBuildingData(OfficeNumber int, OccupiedBy int,
     CONSTRAINT ImportBuildingData_PK PRIMARY KEY(OfficeNumber));

-- Populate tables with mock-up data

-- Create target table's data.
-- Say we have 40 buildings, each with 5 floors and 200 offices on each floor.
-- That is 40K rows
INSERT BuildingOccupancy(BuildingId,OfficeNumber,OccupiedBy)
SELECT Building.BuildingID,Office.OfficeNumber, OccupiedBy = ROW_NUMBER() OVER (ORDER BY Office.OfficeNumber,Building.BuildingID)
FROM
  (
    -- this just returns numbers 1..40
    SELECT TOP 40 BuildingID = ROW_NUMBER() OVER (ORDER BY object_id) FROM msdb.sys.objects
  ) AS Building
  CROSS JOIN
  (
    -- this just returns numbers 1..1000
    SELECT TOP 1000 OfficeNumber = ROW_NUMBER() OVER (ORDER BY object_id) FROM msdb.sys.objects
  ) AS Office

-- Create some import data.
-- Suppose the new building data is 90% unchanged, the rest 10% are split between
-- 5% deletes (people "moved out" of the offices) and 5% updates (someone else moved into the office)
-- Let's say import data happens to be about building 12

INSERT ImportBuildingData(OfficeNumber, OccupiedBy)
SELECT OfficeNumber, OccupiedBy
FROM BuildingOccupancy WHERE BuildingId = 12

-- Delete 5% of the data
DELETE ImportBuildingData
WHERE OfficeNumber % 20 = 1

-- Update 5% of the data
UPDATE ImportBuildingData
SET OccupiedBy = OccupiedBy+30000
WHERE OfficeNumber % 20 = 5

Now, let’s import the data using the MERGE statement. We’ll compare two approaches which produce same result:

-- Approach 1:

DECLARE @InputBuilding INT = 12;

MERGE BuildingOccupancy AS t
  USING ImportBuildingData AS s
  ON t.OfficeNumber = s.OfficeNumber AND t.BuildingId = @InputBuilding
WHEN MATCHED AND t.OccupiedBy <> s.OccupiedBy THEN
  UPDATE SET OccupiedBy = s.OccupiedBy
WHEN NOT MATCHED BY TARGET THEN
  INSERT VALUES(@InputBuilding, OfficeNumber, OccupiedBy)
WHEN NOT MATCHED BY SOURCE AND t.BuildingId = @InputBuilding THEN
  DELETE;

-- Approach 2:

DECLARE @InputBuilding INT = 12;

;WITH FilteredBuilding AS (
       SELECT *
       FROM BuildingOccupancy
       WHERE BuildingId = @InputBuilding
)
MERGE FilteredBuilding AS t
  USING ImportBuildingData AS s
  ON t.OfficeNumber = s.OfficeNumber AND t.BuildingId = @InputBuilding
WHEN MATCHED AND t.OccupiedBy <> s.OccupiedBy THEN
  UPDATE SET OccupiedBy = s.OccupiedBy
WHEN NOT MATCHED BY TARGET THEN
  INSERT VALUES(@InputBuilding, OfficeNumber, OccupiedBy)
WHEN NOT MATCHED BY SOURCE THEN
       DELETE;

Approach 1 uses AND t.BuildingId = @InputBuilding in the delete section to make sure we don’t delete data that belongs to other buildings.
Approach 2 uses CTE (Common Table Expression – refresher) to restrict MERGE operation to only building of interest. Note that there’s no longer need for additional condition in the delete section.

While both approaches achieve same result, performance is quite different. Approach 1 does 500+ reads and Approach 2 does less than 10. This is because the whole target table (BuildingOccupancy) is scanned in the first approach, and in the second approach index seek is used (please feel free to inspect this in the query plan).

You may have noted that approach 2 is using a redundant AND condition in the join (e.g. why did I add AND t.BuildingId = @InputBuilding if it’s already filtered out in the CTE). This is actually needed to help optimizer a bit. If we were to use literal value, the additional clause would not be necessary. SQL team might optimize this in the next version, but adding this extra clause does not hurt.

MERGE also allows use of view or TVF as a source, but I believe using CTE is most readable choice.

Acknowledgements: Thanks to Sara Tahir for the CTE trick and to Andrew Richardson for explaing the need of redundant join clause.

P.S. If you have SQL Server 2008 older than RC0, above MERGE statements will fail with syntax error. There was a slight wording change in merge clauses between CTP6 and RC:

Old Syntax:

New Syntax (RC0 and all subsequent releases):

WHEN TARGET NOT MATCHED

WHEN NOT MATCHED BY TARGET

WHEN SOURCE NOT MATCHED

WHEN NOT MATCHED BY SOURCE