Partilhar via


T-SQL Update Takes Much Longer Than The Matching Select Statement

I realize the title is generic and that is because the problem is generic.   On the surface it would not surprise me that an update takes longer than a select. (A little bit anyway.)

There is logging, updates to index rows, triggers, replication needs, AlwaysOn needs, perhaps page splits and even re-ordering when the key values are changed that can take place.  All of this leads to additional work and elongates the processing.

However, I am taking about a situation where the update takes significantly longer than the select and it seems excessive when you start looking at the actual number of I/O operations and such that are taking place.

-------------------------------------------------------------------------------

For example: Select query runs in 22 minutes but update takes 140 minutes.   When I first looked at this and similar issues, I found it hard to believe the additional update-related processing added that much time.  

The customer had uncovered that if they set the SQL Server's max server memory setting to 2GB the update ran in ~45 minutes instead of 145 minutes. - Keep this in mind.

First Step 

I started looking at the plan differences between the update and the same select criteria.   I didn't see anything that popped out as significantly different for obtaining data on the select side of the input and the update side of the plan looked reasonable.

I then looked at the plan differences between the 2GB server memory setting and a larger memory setting - No differences, the same plan was being used. - Hmmm - interesting.

Second Step

I looked a the statistics time and I/O outputs to see if something significant could be uncovered.   The I/O was about the same but there was a significant difference in the CPU usage between the update and the select.

Third Step

Back to the plan for the update.  I was looking to see if it was possible the update portion of the plan could drive CPU if we had fetched the pages into memory.   Clearly I can come up with an update that touches a small number of pages, gets them locked into buffer pool memory and then updates the same rows many times; driving CPU and not physical I/O. - This was not the case for the scenario presented.  I had to update millions of rows to reproduce the problem.

Fourth Step

Started tracing the activity to see what other things were going on.   What I saw was lock escalation taking place when the query ran faster, under the 2GB SQL Server max server memory setting.

Locking

Now I had a pretty good idea that locking played a role in all of this.    I then enabled trace flag ( -T1211 - use with caution) to disable lock escalation and I could cause the same issue on the SQL Server's max server memory setting to 2GB installation.

Fifth Step

Using debugging tools I captured the execution of the scenario and looked at those code paths using the most CPU resources.  What I found was a code path related to creation and destruction of a lock class (but unfortunately there are no XEvents or Trace events in this area.)

Note: I did file work items with the development team to expose this activity.

Lock Class

At a high level , a lock class is a container of locks for a given part of a plan.  It is often used to protect a plan from changing data as the data passes from one portion of the plan to the next.  (Think hash, sort, spool … type of operations).

Let's discuss the following example:

update t   set t.strData = CAST(t.iID as varchar(10))   from tblTest t   join tblTest t2 on t2.iID = t.iID

 

  |--Table Update(OBJECT:([tempdb].[dbo].[tblTest] AS [t]), SET:([tempdb].[dbo].[tblTest].[strData] as [t].[strData] = [Expr1006]))
       |--Table Spool
          |--Top(ROWCOUNT est 0)                  |--Compute Scalar(DEFINE:([Expr1006]=CONVERT(varchar(10),[tempdb].[dbo].[tblTest].[iID] as [t].[iID],0)))                       |--Nested Loops(Left Semi Join, WHERE:([tempdb].[dbo].[tblTest].[iID] as [t].[iID]=[tempdb].[dbo].[tblTest].[iID] as [t2].[iID]))                            |--Table Scan(OBJECT:([tempdb].[dbo].[tblTest] AS [t]))                            |--Table Scan(OBJECT:([tempdb].[dbo].[tblTest] AS [t2]))

I have highlighted the table spool because this is where a lock class can appear (which you can't see).   What the SQL Server does is look at the locking strategy established by the session and the plan and if necessary upgrade the strategy to 'repeatable read' during portions of the plan.

In this case the rows for the update are being fed via a table spool.   This means SQL Server does not want to release the lock on the rows flowing through the spool until it has completed the proper update(s).  If the lock was not held through the table spool to the update level, the data could change as it is held in the spool. 

The problem is not the lock class.  Even if the isolation level is upgraded to 'repeatable read', during the window, resulted in reduced concurrency; that in and of itself won't lead to increased CPU usage.

The CPU is coming from the release activity associated with the lock class.   Once the lock class is no longer needed the SQL Server releases it and in turn the references to the appropriate locks are released.

Clearly I can see that lock escalation will reduce the number of locks and reduce this work. One workaround but probably not what most folks want to do in a production environment.

My first thought was how many locks is the update requiring that would cause us to do a lot of work during release?   In studying the lock acquired and released events for the update statement I found that it was only a handful.  So again, why the large CPU burn? - The problem was the update was inside a cursor loop that executed millions of times and all of this was under a single transaction.

Another workaround I found was to use smaller transactions, but more of them.   I found this odd as I am still doing the same amount of work just in smaller chunks.   Smaller chunks would help avoid lock escalations and I thought it would make it worse based on previous testing.

The Problem

What is happening is the lock class has a bug.  It is not properly releasing just the locks it acquired.  It is running the entire lock list for the transaction.   Because the locks acquired before the update don't have a lock class association there is nothing to do for those locks.   The locks are properly maintained, the SQL Server is just running the entire lock list instead of the portion associated with the lock class.

In this reproduction case there are 400,000 locks acquired before the cursor execution driving the updates.  This means that each update will run ALL 400,000 lock structures and find nothing to do.  As the next update occurs and SQL Server does it all over again, burning the unnecessary CPU.   Based on this behavior; when I dropped the size of the transaction I reduced the number of locks and as such the number of CPU cycles.

Note: Bug filed with the SQL Server development team.  Internal testing, with the bug fix, shows the query taking 140 minutes consistently runs in 32 minutes without any T-SQL code changes.   This is scheduled to be released in an upcoming release (~ Nov 2012).

----------------------------------------------------------------------------------------
Is This My Problem? Am I a Candidate for the Fix?

There are some things you can do to see if your statements are encountering this problem.

The easiest way to test is to 'set transaction isolation level repeatable read' at the session level and then run the set of queries in question.    Repeatable read may use more lock memory but it also acts as a broad 'lock class' for this test.   I have also used snapshot isolation with near repeatable read results.   

If the query has some of the before mentioned operations, runs significantly faster and uses far less CPU there is a good chance it is a candidate for the fix.

Without the fix you can use the transaction isolation levels, smaller transactions, or even locking hints to help control the behavior.

 

Bob Dorr - Principal SQL Server Escalation Engineer
assistance provide by Keith Elmore - - Principal SQL Server Escalation Engineer

Comments

  • Anonymous
    September 21, 2012
    Good Article.

  • Anonymous
    September 21, 2012
    Really interesting and useful post which learnt me many things ( it is far from my favorite topics SMO and SQL Server Data Access even if for the last topic, it is a problem i could meet ) You wrote "The customer had uncovered that if they set the SQL Server's max server memory setting to 2GB the update ran in ~45 minutes instead of 145 minutes."  Is it possible to know the max server memory used by the customer when he/she met this problem the 1st time ( to understand the importance othe max server memory ) ? And how long time between the call of the customer and the release of this post ? I suppose that all your tests needed numerous hours When the fix will be release, please , could you update your post to provide the number of the KB ? ( it would be useful for people facing this problem ) Anyway, an excellent example on the best way to test problems . So, many thanks

  • Anonymous
    September 25, 2012
    Would that bug affect different versions of SQL Server or is it specific to SQL Server 2012? And if it does affect previous version, are there any plans to release a fix for them?

  • Anonymous
    September 26, 2012
    (This comment has been deleted per user request)

  • Anonymous
    October 03, 2012
    Thanks for sharing this with the community.

  • Anonymous
    October 03, 2012
    Let me respond to some of the comments.

  • It applies to SQL 2005 and newer builds
  • Unless you have the capture details from the first run you can't tell the memory footprint
  • Anonymous
    October 16, 2012
    Will there be fixes available for SQL2005 and upwards or were they included in the MS12-070 fixes? Thanks Chris

  • Anonymous
    December 06, 2012
    SQL 2008 R2 - support.microsoft.com/.../2754301 SQL 2012 - Target of CU5 (Dec 2012)  - A CU4 based, on demand fix, is available if you contact support