SQL Server: Delete a Huge Amount of Data from a Table

Article
1/17/2024

How often do you have a situation when you need to remove old records from a table to free storage for new data? If you have a very large table with billions of records the correct solution is essential for the workload and all other underlying operations:

Consider the huge amount of transaction log a DELETE operation will cause
If you have a high availability scenario the logged operation need to be transferred to the mirror / AlwaysOn-Secondaries / Log shipping partners
Each logged operation is time consuming
Resources will be locked until the operation is confirmed

The above topics may give an idea of the correct strategy for deletion. Very often the following recommendation is given in forums:

Set the database in SIMPLE mode
Make a full backup
Delete the data
Rebuild the index
Return to the FULL recovery mode

The above strategy may work on a single server but cannot be used in a High Availability scenario because it requires the FULL recovery mode! This article compares two different solutions and describes the actions "behind the scene".

Environment for tests

The first script will create a demo database and fill a demo table with 100.000 records. This table will be used for the investigation of all next scenarios:

IF db_id('DeleteRecord') IS  NOT NULL
BEGIN
    ALTER DATABASE  DeleteRecord SET  SINGLE_USER WITH  ROLLBACK IMMEDIATE; 
    DROP DATABASE  DeleteRecord; 
END


CREATE DATABASE  DeleteRecord 
ON PRIMARY 
(  
    NAME = N'DelRecord',  
    FILENAME = N'S:\Backup\DelRecord.mdf', 
    SIZE = 100MB, 
    MAXSIZE = 1000MB, 
    FILEGROWTH = 100MB 
)
LOG ON 
( 
    NAME = N'DelRecord_log', 
    FILENAME = N'S:\Backup\DelRecord.ldf', 
    SIZE = 100MB, 
    MAXSIZE = 1GB, 
    FILEGROWTH = 100MB 
); 
GO 


-- change the recovery model to SIMPLE 
ALTER DATABASE  DeleteRecord SET  RECOVERY SIMPLE; 
GO

-- Now create a table with a clustered index on an Id and add 100.000 records  
USE DeleteRecord; 
GO

CREATE TABLE  dbo.bigtable
(
   Id  int                      NOT NULL    IDENTITY (1, 1),
   c1  char(100)     NOT NULL     DEFAULT ('only stupid stuff'),
   c2  varchar(100)  NOT NULL     DEFAULT ('more stupid stuff'), 
   c3  date                    NOT NULL    DEFAULT  (getdate()),  

    CONSTRAINT pk_bigTable PRIMARY KEY  CLUSTERED (Id)
);
GO 


SET NOCOUNT ON
GO


INSERT INTO  dbo.bigtable DEFAULT  VALUES
GO 100000

Setting the recovery mode of the database to BULK_LOGGED before usage of DELETE

The above recommendation can be read quite often based on questions for fast deletion of data. This solution is not feasible because the DML-Operation DELETE is a fully logged operation which does not support BULK-operations. The following script runs a DELETE operation against the above table and deletes all even numbered records. This operation will be run in all three recovery modes (SIMPLE, FULL, BULK_LOGGED)

-- Change the recovery model for each run!
ALTER DATABASE  DeleteRecord SET  RECOVERY SIMPLE;
GO

BEGIN TRANSACTION
    -- delete half of the records
    DELETE  dbo.bigTable WHERE Id % 2 = 0;


    -- rebuild the index because it's fragmented
    ALTER INDEX  ALL ON  dbo.bigTable REBUILD;

    SELECT  database_transaction_log_bytes_used
    FROM    sys.dm_tran_database_transactions
    WHERE   database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION

The DELETE operation is included in a dedicated transaction to measure the amount of log space which will be written within the transaction. After half of the records have been deleted the clustered index need to be rebuild as part of the complete DELETE-process because of a low density in the data pages (only 50% filled) after DELETE has finished. The ROLLBACK TRANSACTION has been used to avoid the rebuild of the test scenario for each different test.
Concerning the above operational process the generated amount of transaction log is as follows:

Recovery model	Transaction Log
SIMPLE	11.065.344	Bytes
BULK_LOGGED	11.096.036	Bytes
FULL	18.086.656	Bytes

The result shows that BULK_LOGGED as a recovery model will not change significantly the amount of produced transaction log. And the result is not really surprising because DELETE is no DML-Operation which supports BULK-operations. If the recovery model is set to FULL then all operations are full logged! So the above strategy doesn't seem to be the best choice concerning the produces amount of transaction log space. Some people recommend to delete data in small portions and backup the log - but this scenario isn't considered for this article.

Use of TRUNCATE instead of DELETE for the deletion of records

The usage of TRUNCATE instead of DELETE seems to be inviting, because TRUNCATE does not log the deletion of each record. TRUNCATE is a DDL (Data Definition Language) operation and not a DML (Data Manipulation Language) operation and only the system tables are part of the transaction. TRUNCATE removes the complete metadata structure of the underlying table and ALL records. So a step in between has to be done: All records which should stay need to be stored in a staging table. The next code demonstrates the process in detail:


-- Drop the staging table
IF OBJECT_ID('dbo.bigtable_intermediate', 'U') IS NOT NULL
    DROP TABLE  dbo.bigtable_intermediate;
    GO


BEGIN TRANSACTION
    -- Bulk logged 
    SELECT  * INTO dbo.bigtable_intermediate 
    FROM dbo.bigtable 
    WHERE Id % 2 = 1;


    -- minimal logged because DDL-Operation 
    TRUNCATE TABLE  dbo.bigtable;


    SET IDENTITY_INSERT dbo.bigTable ON; 
    INSERT INTO  dbo.bigtable (Id, c1, c2, c3)
    SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY  Id;
    SET IDENTITY_INSERT dbo.bigtable OFF;

    SELECT  database_transaction_log_bytes_used
    FROM    sys.dm_tran_database_transactions
    WHERE   database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION

The above code first checks whether the staging table exists. Due to the fact that a BULK operation will be used ("SELECT INTO") the staging table does not exist. Otherwise the operation will fail. After the remaining records of the database have been copied into the staging table the source table will be rebuild by using TRUNCATE. When the table has been rebuilt the data from the staging table will be transferred back to the original table. An INDEX REBUILD is not necessary because the TRUNCATE will create a new metadata scheme of the table which has no "old pages". By using ORDER BY in the INSERT INTO statement it's guaranteed that no index fragmentation will occur. The results for the amount of transaction log in all recovery modes are as follows:

Recovery Model	Transaction Log
SIMPLE	12.006.508	Bytes
BULK_LOGGED	12.006.508	Bytes
FULL	19.091.820	Bytes

The recovery model SIMPLE and BULK_LOGGED perform "best" contrary to the recovery model FULL. The main part of the amount of transaction log will be created by the INSERT INTO operation while in SIMPLE and BULK_LOGGED the two first operations ("SELECT ... INTO" and "TRUNCATE") generate less than 1% of the total amount of transaction log (!!!):

SELECT...INTO consumes ~40 Kbytes
TRUNCATE consumes ~1 KByte

Why does the amount of transaction log action is so high for the INSERT ... INTO ... SELECT *? The INSERT .. INTO ..-Command can only benefit from a BULK LOGGED operation if it is exclusively locked. An exclusive lock can be forced with the TABLOCK hint as the following scenario shows.


BEGIN TRANSACTION
    -- Bulk logged 
    SELECT  *
    INTO    dbo.bigtable_intermediate
    FROM    dbo.bigtable
    WHERE   Id % 2 = 0;

    -- minimal logged because DDL-Operation 
    TRUNCATE TABLE  dbo.bigtable;  

    -- Bulk logged because target table is exclusivly locked! 
    SET IDENTITY_INSERT dbo.bigTable ON;
    INSERT INTO  dbo.bigtable WITH  (TABLOCK) (Id, c1, c2, c3)
    SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY  Id;
    SET IDENTITY_INSERT dbo.bigtable OFF;

    SELECT  database_transaction_log_bytes_used
    FROM    sys.dm_tran_database_transactions
    WHERE   database_id = db_id('DeleteRecord');
ROLLBACK TRANSACTION

The difference in the amount of generated transaction log is tremendous:

Recovery Model	Transaction Log
SIMPLE	89.568	Bytes
BULK_LOGGED	89.568	Bytes
FULL	14.128.668	Bytes

The recovery model FULL will not cover any BULK logged operation and will always force fully logged operations. Only the recovery model SIMPLE and BULK_LOGGED give the possibility to work with minimal logged operations.

Conclusion

Due to the growing amount of data the requirement for fast workloads are growing simultaneously. Microsoft SQL Server will meet these requirements with bulk logged operational functionality. Basically a difference between DML-Operations which support bulk logged operations and DML-operations which does not has to be considered.
DELETE is not a DML-operation which support bulk logging and if a table does not have constraints which prevents a TRUNCATE than the following workload seems to be the best by consideration of bulk logged operations:

Separate the data which should not be deleted by using SELECT ... INTO
Rebuild the table by using TRUNCATE
Fill the new created table by using INSERT INTO... SELECT with TABLOCK as hint for the target table

SELECT...INTO and INSERT INTO both support bulk logged operations and TRUNCATE is a tremendous way to get rid of data because it is only schema changing DDL!

Share via

SQL Server: Delete a Huge Amount of Data from a Table

Environment for tests

Setting the recovery mode of the database to BULK_LOGGED before usage of DELETE

Use of TRUNCATE instead of DELETE for the deletion of records

Conclusion

See Also

Additional resources