How does SQL Server perform deletes in my table?
In this post we’ll look at how SQL Server deletes records in a table. In this example, we’ll be using a clustered index.
For performance, SQL Server does not actually delete a row of data when requested by a transaction. By not incurring this cost when the transaction is executing, performance is enhanced. Additionally, less work needs to be done should the transaction eventually be rolled back. Instead, SQL Server marks the record as being “ghosted”. Then a background thread in SQL Server periodically finishes the job of removing these records from the page by fully deleting them and doing any related cleanup work – such as adjusting the slot array. When a record is marked as a “ghost” record, the corresponding bits are modified in a bitmap in the header of the record structure to reflect this.
So let’s dig into the internals of SQL Server to prove this… First, we’ll do this example today with SQL Server 2008 RTM (I hope to get the latest service pack applied before the next post). The background thread runs periodically (every 5 seconds) looking for these ghosted records. Each time, it progresses through a handful of pages to avoid degrading system performance. The pages marked as ghosted will not be returned by any queries. To capture the ghost record cleanup task, we use the following TSQL:
-- create a table with the same structure as sys.dm_exec_requests
select * into #GhostBuster
from (select * from sys.dm_exec_requests where 1=0) x-- create & init a variable to hold the count of threads found
declare @i tinyint
set @i = 0;-- now loop until we find at least 1 record
while (@i = 0)
begin
insert into #GhostBuster
select * from sys.dm_exec_requests where command like '%
GHOST%';
set @i = (select count(*) from #GhostBuster);
end-- show the record
select * from #GhostBuster
Our result is:
So now we see that we do in fact have a process running in the background doing cleanup of ghosted records. But how can we prove this further? Let’s walk through deleting a record and watch this happen.
So first, I am using the AdventureWorksLT database from the SQL Server 2008 Product Sample Databases from Codeplex. If you need those, then go here:
Now, I’ll be deleting a record from the SalesLT.SalesOrderDetail table. I’ve chosen a record at random, and the record I’ve chosen is for ProductID 990 of SalesOrderID 71780:
select * from SalesLT.SalesOrderDetail where SalesOrderID = 71780 and ProductID = 990
So now, we want to watch this record get ghosted – then deleted. First thing we need to do is find what page this record exists on. We’ll start out by clearing out the buffer pool entirely.
-- flush dirty pages
checkpoint
-- drop clean ones
dbcc dropcleanbuffers
Now, we query our record to pull the page into the buffer pool:
select * from SalesLT.SalesOrderDetail where SalesOrderID = 71780 and ProductID = 990
Now, let’s view the buffers:
select * from sys.dm_os_buffer_descriptors where database_id = DB_ID('AdventureWorksLT')
So now we see a few things of interest… First, we are dealing with database ID 9 – which we’ll need in just a minute. Secondly, we have a mixture of data pages (DATA_PAGE) and index pages (INDEX_PAGE). In the red box, we can see that this page (and one other) has a page_level of 1 whereas the others have a page_level of 0. SQL Server uses doubly-linked lists in a b-tree data structure for indexes. These structures have levels. A level of 0 equates to what is often referred to as the “leaf” level. This is the level of the structure that has the information we want. Since we are working with a clustered index, the pages at the leaf level are data pages. In a non-clustered index, they would be index pages (these are the pages marked INDEX_PAGE at page_level 0). So, in order to find our page, check the intermediate level pages on level 1 and see where they take us:
NOTE: we could have narrowed this down by querying in on metadata, but this is a small example and helps to explain this structure
dbcc traceon(3604)
godbcc page(9,1,216,3)
go
Page 216 is an intermediate index level page. SalesOrderID is part of the key of the clustered index, so it contains these values to direct seeks/scans to the appropriate child pages. We know that we are interested in 71780. It is less than row 2 which directs everything starting at 71784 to page 191, so our record must be on the preceeding page – 189. Let’s look:
dbcc page(9,1,189,3)
go
PAGE: (1:189)
BUFFER:
BUF @0x0000000086F8D500
bpage = 0x000000008616A000 bhash = 0x0000000000000000 bpageno = (1:189)
bdbid = 9 breferences = 0 bUse1 = 34028
bstat = 0x1c00009 blog = 0x159a2159 bnext = 0x0000000000000000PAGE HEADER:
Page @0x000000008616A000
m_pageId = (1:189) m_headerVersion = 1 m_type = 1
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x224
m_objId (AllocUnitId.idObj) = 58 m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594041729024
Metadata: PartitionId = 72057594040156160 Metadata: IndexId = 1
Metadata: ObjectId = 533576939 m_prevPage = (0:0) m_nextPage = (1:191)
pminlen = 58 m_slotCnt = 128 m_freeCnt = 32
m_freeData = 7904 m_reservedCnt = 0 m_lsn = (20:228:37)
m_xactReserved = 0 m_xdesId = (0:625) m_ghostRecCnt = 0
m_tornBits = -1357435984Allocation Status
GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:6) = NOT CHANGED
ML (1:7) = NOT MIN_LOGGEDSlot 0 Offset 0x60 Length 61
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 61
… <continues>
Here we have a data page since m_type is 1. Therefore, this is a leaf level page in the clustered index. We can analyze the output of DBCC PAGE and the structure of a page in another post, but for now – notice m_type is 1 (DATA_PAGE) and that the ghost record count (m_ghostRecCnt) = 0. If we scroll down, we find out record in the page:
Slot 7 Offset 0x20b Length 61
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 61
Memory Dump @0x0000000010F7A20B
0000000000000000: 10003a00 64180100 1cb00100 0100de03 †..:.d....°....Þ.
0000000000000010: 00000470 31000000 00000000 00000000 †...p1...........
0000000000000020: 000049a4 17f19d03 b844a4b2 b12001da †..I¤.ñ..¸D¤²± .Ú
0000000000000030: cc010000 0000f994 00000800 00††††††††Ì.....ù”.....Slot 7 Column 1 Offset 0x4 Length 4 Length (physical) 4
SalesOrderID = 71780
Slot 7 Column 2 Offset 0x8 Length 4 Length (physical) 4
SalesOrderDetailID = 110620
Slot 7 Column 3 Offset 0xc Length 2 Length (physical) 2
OrderQty = 1
Slot 7 Column 4 Offset 0xe Length 4 Length (physical) 4
ProductID = 990
… <continues>
The record continues, but above we see we have a primary data record with a SalesOrderID of 71780 and a ProductID of 990 – so we have the right page. The record is in slot 7 of the slot array.
Now, let’s delete the record. But we’ll wrap this in a transaction that we do *not* commit so that we can keep a lock on the page. This will prevent the Ghost Cleanup Task from deleting it. So we start with the delete:
begin tran
delete from SalesLT.SalesOrderDetail where SalesOrderID = 71780
and ProductID = 990
And now we see the record has changed to ghosted – but still on the page from the DBCC PAGE command:
Slot 7 Offset 0x20b Length 61
Record Type = GHOST_DATA_RECORD Record Attributes = NULL_BITMAP Record Size = 61
Memory Dump @0x0000000010F7A20B
0000000000000000: 1c003a00 64180100 1cb00100 0100de03 †..:.d....°....Þ.
0000000000000010: 00000470 31000000 00000000 00000000 †...p1...........
0000000000000020: 000049a4 17f19d03 b844a4b2 b12001da †..I¤.ñ..¸D¤²± .Ú
0000000000000030: cc010000 0000f994 00000800 00††††††††Ì.....ù”.....
So now we know the record is still on the page, but marked as a “ghosted” record in the header of the record. Additionally, we know that the record is locked since we did not commit our transaction – this prevents the Ghost Cleanup task from removing it until we are ready:
select * from sys.dm_tran_locks
Now, we want to watch SQL finish it off. Let’s do that with windbg (the windows GUI debugger). First, we need the PID of our SQL Server process (sqlservr.exe):
select SERVERPROPERTY('processid')
Now, we can attach the debugger and configure our symbols using the public symbols available at:
Here we load the symbols in windbg:
0:048> .sympath srv*c:\symbols\public*https://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:\symbols\public*https://msdl.microsoft.com/download/symbols
Expanded Symbol search path is: srv*c:\symbols\public*https://msdl.microsoft.com/download/symbols
0:048> .reload /f sqlservr.exe
What we want now is to know when that page is accessed. The debugger let’s us do this by breaking when a memory address is accessed. Our page is an instance of class that is just loaded into memory – into the buffer pool. So looking above at our DBCC PAGE readout, we can get the memory address of this in memory:
Page @0x000000008616A000
So we want to set a break so that the debugger halts execution of SQL Server anytime something accesses this memory address – which would be sqlservr.exe touching our page. So we issue the following windbg command:
NOTE: Always do this on a test or dev box – this effectively STOPS SQL Server from executing
0:048> ba r4 0x000000008616A000
0:048> bl
0 e 00000000`8616a000 r 4 0001 (0001) 0:****
The ‘ba’ command sets a “break on access” to memory. When something attempts to read at that address, the debugger will halt sqlservr.exe. The ‘bl’ command just lists all current breakpoints.
So now, we need to go back, check our record to see it is still there, and then commit the transaction so we release the lock on the page. First, we have to issue the ‘g’ command so the debugger will continue execution of sqlservr.exe. Now let’s verify our record is still on the page in a “ghosted” state with DBCC PAGE:
Slot 7 Offset 0x20b Length 61
Record Type = GHOST_DATA_RECORD Record Attributes = NULL_BITMAP Record Size = 61
Memory Dump @0x000000000D3EA20B
The record still exists as expected in a “ghosted’ state. Now we commit and watch the debugger finish removing the record:
COMMIT
Now, we effectively “step through” the deletion process by giving the command ‘g’ to allow the debugger to continue each time the breakpoint is hit. Here we are hitting the breakpoints with each successive frame on the stack while the ghost cleanup task removes this row:
0:037> g
Breakpoint 0 hit
sqlservr!FixPageNotThruLinkage+0x224:
00000000`00f7fd59 3c01 cmp al,1
0:036> g
Breakpoint 0 hit
sqlservr!IndexPageRef::IsGhostRemovable+0xb7:
00000000`00f7f4b7 0f85e17ee800 jne sqlservr!alloca_probe+0x13b731 (00000000`01e0739e) [br=0]
0:036> g
Breakpoint 0 hit
sqlservr!IndexPageRef::IsGhostRemovable+0xb7:
00000000`00f7f4b7 0f85e17ee800 jne sqlservr!alloca_probe+0x13b731 (00000000`01e0739e) [br=0]
0:036> g
Breakpoint 0 hit
sqlservr!IndexPageRef::ExpungeGhost+0x20:
00000000`0148fd00 4c8b4008 mov r8,qword ptr [rax+8] ds:00000000`810ab018=00000000810aa8e0
0:036> g
Breakpoint 0 hit
sqlservr!PageLog::MapContext+0x9:
00000000`0147c6a9 4183f901 cmp r9d,1
0:036> g
Breakpoint 0 hit
sqlservr!Page::DeleteRow+0x247:
00000000`01036fee 0f8507475000 jne sqlservr!alloca_probe+0x9bcb9 (00000000`0153b6fb) [br=0]
That last call to DeleteRow should have done the deed. To see the whole thread’s progress, let’s check the call stacktrace:
0:036> kc
Call Site
sqlservr!Page::DeleteRow
sqlservr!PageRef::ExpungeGhostRow
sqlservr!IndexPageRef::ExpungeGhost
sqlservr!CleanVersionsOnBTreePage
sqlservr!IndexDataSetSession::CleanupVersionsOnPage
sqlservr!GhostExorciser::CleanupPage
sqlservr!TaskGhostCleanup::ProcessTskPkt
sqlservr!GhostRecordCleanupTask
sqlservr!CGhostCleanupTask::ProcessTskPkt
sqlservr!TaskReqPktTimer::ExecuteTask
sqlservr!OnDemandTaskContext::ProcessTskPkt
sqlservr!SystemTaskEntryPoint
sqlservr!OnDemandTaskContext::FuncEntryPoint
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
MSVCR80!_callthreadstartex
MSVCR80!_threadstartex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Then take a look at the parameters passed to ExpungeGhostRow:
00000000`00000007 00000000`85c6d970 00000000`00000000 00000000`8616a000 : sqlservr!PageRef::ExpungeGhostRow+0x245
The first parameter is 00000000`00000007 – or 7. Remember, our row was at Slot 7?
So now it should be gone, all that remains is to hit ‘g’ to go back to SQL Server and use DBCC PAGE to confirm:
Slot 7 Offset 0x248 Length 61
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 61
Memory Dump @0x000000000FADA248
0000000000000000: 10003a00 64180100 1db00100 01009e03 †..:.d....°....ž.
0000000000000010: 000074de 16000000 00000000 00000000 †..tÞ............
0000000000000020: 00002b05 e592d072 914c9a8c 42591803 †..+.å’Ðr‘LšŒBY..
0000000000000030: 667e0000 0000f994 00000800 00††††††††f~....ù”.....Slot 7 Column 1 Offset 0x4 Length 4 Length (physical) 4
SalesOrderID = 71780
Slot 7 Column 2 Offset 0x8 Length 4 Length (physical) 4
SalesOrderDetailID = 110621
Slot 7 Column 3 Offset 0xc Length 2 Length (physical) 2
OrderQty = 1
Slot 7 Column 4 Offset 0xe Length 4 Length (physical) 4
ProductID = 926
… <continues>
Slot 7 now points to another row with a product ID of 926. A search of the entire output of DBCC PAGE shows that Product ID 990 is nowhere on the page.
That’s it, SQL Server has deleted our row in the background after the “delete” operation completed.
-Jay
Comments
- Anonymous
May 21, 2013
Hi Jay,I have a question.---------------Script---------------------------------------use AdventureWorks;goselect * into testfrom sales.SalesOrderDetail (nolock)goalter table test add primary key (SalesOrderDetailID)-- SalesOrderDetailID int,identity,max:121317goset statistics io onselect MIN (SalesOrderDetailID) from test --logical reads:3set statistics io offgodelete from test where SalesOrderDetailID <= 121317goset statistics io onselect MIN (SalesOrderDetailID) from test --logical reads:2999set statistics io offSo my questions are:1:Why the two select querys logical IO differ so much?2:Are index intermediate levels reflect the delete changes immediately, or after ghost-cleanup process finish deleted-records cleanup process?please forgive me my poor english,thanks. - Anonymous
May 23, 2013
The comment has been removed - Anonymous
May 24, 2013
Hi Jay,Thansk a lot for your in time reply.i also made a test:1.use "dbcc ind ('AdventureWorks','test',1)" to find table test pages,and record levle 2 index page number:195482.run script below:dbcc page (8,1,19548,3) with tableresults --page contains:6 rowsdelete from test where SalesOrderDetailID <= 110000dbcc page (8,1,19548,3) with tableresults --page contains:6 rowswaitfor delay '00:00:03'dbcc page (8,1,19548,3) with tableresults --page contains:3 rows3.so my guess is index top level pages do not become smaller after deletion immediatly ,so my second select query after delete statement do access the data as the first select query,but when they get the first index data page (all the records in this page have been deleted,and marked as ghost),it has to get data from the second index data page,untill it get to the first page where contains records not deleted.4.my first thought is when records are being deleting,some index nondata pages will become smaller accordingly and reflect the deletion,not wait a while,but the test shows this not the case. - Anonymous
June 03, 2013
Correct Mark. The index level records cannot be deleted if the data rows still exist. All of it is cleaned up in the background by the ghost record cleanup later.