SQL Server: Delete a Huge Amount of Data from a Table
How often do you have a situation when you need to remove old records from a table to free storage for new data? If you have a very large table with billions of records the correct solution is essential for the workload and all other underlying operations:
- Consider the huge amount of transaction log a DELETE operation will cause
- If you have a high availability scenario the logged operation need to be transferred to the mirror / AlwaysOn-Secondaries / Log shipping partners
- Each logged operation is time consuming
- Resources will be locked until the operation is confirmed
The above topics may give an idea of the correct strategy for deletion. Very often the following recommendation is given in forums:
- Set the database in SIMPLE mode
- Make a full backup
- Delete the data
- Rebuild the index
- Return to the FULL recovery mode
The above strategy may work on a single server but cannot be used in a High Availability scenario because it requires the FULL recovery mode! This article compares two different solutions and describes the actions "behind the scene".
Environment for tests
The first script will create a demo database and fill a demo table with 100.000 records. This table will be used for the investigation of all next scenarios:
IF db_id('DeleteRecord') IS NOT NULL
BEGIN
ALTER DATABASE DeleteRecord SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE DeleteRecord;
END
CREATE DATABASE DeleteRecord
ON PRIMARY
(
NAME = N'DelRecord',
FILENAME = N'S:\Backup\DelRecord.mdf',
SIZE = 100MB,
MAXSIZE = 1000MB,
FILEGROWTH = 100MB
)
LOG ON
(
NAME = N'DelRecord_log',
FILENAME = N'S:\Backup\DelRecord.ldf',
SIZE = 100MB,
MAXSIZE = 1GB,
FILEGROWTH = 100MB
);
GO
-- change the recovery model to SIMPLE
ALTER DATABASE DeleteRecord SET RECOVERY SIMPLE;
GO
-- Now create a table with a clustered index on an Id and add 100.000 records
USE DeleteRecord;
GO
CREATE TABLE dbo.bigtable
(
Id int NOT NULL IDENTITY (1, 1),
c1 char(100) NOT NULL DEFAULT ('only stupid stuff'),
c2 varchar(100) NOT NULL DEFAULT ('more stupid stuff'),
c3 date NOT NULL DEFAULT (getdate()),
CONSTRAINT pk_bigTable PRIMARY KEY CLUSTERED (Id)
);
GO
SET NOCOUNT ON
GO
INSERT INTO dbo.bigtable DEFAULT VALUES
GO 100000
Setting the recovery mode of the database to BULK_LOGGED before usage of DELETE
The above recommendation can be read quite often based on questions for fast deletion of data. This solution is not feasible because the DML-Operation DELETE is a fully logged operation which does not support BULK-operations. The following script runs a DELETE operation against the above table and deletes all even numbered records. This operation will be run in all three recovery modes (SIMPLE, FULL, BULK_LOGGED)
-- Change the recovery model for each run!
ALTER DATABASE DeleteRecord SET RECOVERY SIMPLE;
GO
BEGIN TRANSACTION
-- delete half of the records
DELETE dbo.bigTable WHERE Id % 2 = 0;
-- rebuild the index because it's fragmented
ALTER INDEX ALL ON dbo.bigTable REBUILD;
SELECT database_transaction_log_bytes_used
FROM sys.dm_tran_database_transactions
WHERE database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION
The DELETE operation is included in a dedicated transaction to measure the amount of log space which will be written within the transaction. After half of the records have been deleted the clustered index need to be rebuild as part of the complete DELETE-process because of a low density in the data pages (only 50% filled) after DELETE has finished. The ROLLBACK TRANSACTION has been used to avoid the rebuild of the test scenario for each different test.
Concerning the above operational process the generated amount of transaction log is as follows:
Recovery model |
Transaction Log |
|
SIMPLE | 11.065.344 |
Bytes |
BULK_LOGGED | 11.096.036 |
Bytes |
FULL | 18.086.656 |
Bytes |
The result shows that BULK_LOGGED as a recovery model will not change significantly the amount of produced transaction log. And the result is not really surprising because DELETE is no DML-Operation which supports BULK-operations. If the recovery model is set to FULL then all operations are full logged! So the above strategy doesn't seem to be the best choice concerning the produces amount of transaction log space. Some people recommend to delete data in small portions and backup the log - but this scenario isn't considered for this article.
Use of TRUNCATE instead of DELETE for the deletion of records
The usage of TRUNCATE instead of DELETE seems to be inviting, because TRUNCATE does not log the deletion of each record. TRUNCATE is a DDL (Data Definition Language) operation and not a DML (Data Manipulation Language) operation and only the system tables are part of the transaction. TRUNCATE removes the complete metadata structure of the underlying table and ALL records. So a step in between has to be done: All records which should stay need to be stored in a staging table. The next code demonstrates the process in detail:
-- Drop the staging table
IF OBJECT_ID('dbo.bigtable_intermediate', 'U') IS NOT NULL
DROP TABLE dbo.bigtable_intermediate;
GO
BEGIN TRANSACTION
-- Bulk logged
SELECT * INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 1;
-- minimal logged because DDL-Operation
TRUNCATE TABLE dbo.bigtable;
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
SELECT database_transaction_log_bytes_used
FROM sys.dm_tran_database_transactions
WHERE database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION
The above code first checks whether the staging table exists. Due to the fact that a BULK operation will be used ("SELECT INTO") the staging table does not exist. Otherwise the operation will fail. After the remaining records of the database have been copied into the staging table the source table will be rebuild by using TRUNCATE. When the table has been rebuilt the data from the staging table will be transferred back to the original table. An INDEX REBUILD is not necessary because the TRUNCATE will create a new metadata scheme of the table which has no "old pages". By using ORDER BY in the INSERT INTO statement it's guaranteed that no index fragmentation will occur. The results for the amount of transaction log in all recovery modes are as follows:
Recovery Model | Transaction Log | |
SIMPLE | 12.006.508 |
Bytes |
BULK_LOGGED | 12.006.508 |
Bytes |
FULL | 19.091.820 |
Bytes |
The recovery model SIMPLE and BULK_LOGGED perform "best" contrary to the recovery model FULL. The main part of the amount of transaction log will be created by the INSERT INTO operation while in SIMPLE and BULK_LOGGED the two first operations ("SELECT ... INTO" and "TRUNCATE") generate less than 1% of the total amount of transaction log (!!!):
- SELECT...INTO consumes ~40 Kbytes
- TRUNCATE consumes ~1 KByte
Why does the amount of transaction log action is so high for the INSERT ... INTO ... SELECT *? The INSERT .. INTO ..-Command can only benefit from a BULK LOGGED operation if it is exclusively locked. An exclusive lock can be forced with the TABLOCK hint as the following scenario shows.
BEGIN TRANSACTION
-- Bulk logged
SELECT *
INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 0;
-- minimal logged because DDL-Operation
TRUNCATE TABLE dbo.bigtable;
-- Bulk logged because target table is exclusivly locked!
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable WITH (TABLOCK) (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
SELECT database_transaction_log_bytes_used
FROM sys.dm_tran_database_transactions
WHERE database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION
The difference in the amount of generated transaction log is tremendous:
Recovery Model | Transaction Log | |
SIMPLE | 89.568 |
Bytes |
BULK_LOGGED | 89.568 |
Bytes |
FULL | 14.128.668 |
Bytes |
The recovery model FULL will not cover any BULK logged operation and will always force fully logged operations. Only the recovery model SIMPLE and BULK_LOGGED give the possibility to work with minimal logged operations.
Conclusion
Due to the growing amount of data the requirement for fast workloads are growing simultaneously. Microsoft SQL Server will meet these requirements with bulk logged operational functionality. Basically a difference between DML-Operations which support bulk logged operations and DML-operations which does not has to be considered.
DELETE is not a DML-operation which support bulk logging and if a table does not have constraints which prevents a TRUNCATE than the following workload seems to be the best by consideration of bulk logged operations:
- Separate the data which should not be deleted by using SELECT ... INTO
- Rebuild the table by using TRUNCATE
- Fill the new created table by using INSERT INTO... SELECT with TABLOCK as hint for the target table
SELECT...INTO and INSERT INTO both support bulk logged operations and TRUNCATE is a tremendous way to get rid of data because it is only schema changing DDL!
See Also
DELETE-Command: http://technet.microsoft.com/en-us/library/ms191244.aspx
TRUNCATE-Command: http://technet.microsoft.com/en-us/library/ms177570.aspx
INSERT INTO ...: http://technet.microsoft.com/en-us/library/ms174335.aspx
Note: This article refers to the following question(s) in the MSDN forums of Microsoft:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/6bc6b5c0-99da-4299-9661-7cb96683bdec/need-to-build-a-job-that-purges-old-data-and-rebuilds-indexes?forum=sqldatabaseengine
and
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/76a0976d-4d27-4374-a2e9-c6ca1d83521f/removing-large-number-of-records-with-truncate?forum=sqldatabaseengine
Both threads are searching for equal solutions!