Sdílet prostřednictvím


Troubleshoot high CPU issue without using profile traces

 

The scope of these issues is very wide but I would discuss about the approach which I follow to resolve these issues. I have resolved more than 50% of the cases by using this approach without even looking at profiler traces. I hope this will help you troubleshooting issues in your environment.

It’s specific to SQL server 2008 – 64 bit environments but could be tried on 32 bit environments too.

Let’s discuss about how to start it step by step.

Step 1 : - When you see high CPU utilization on the server check in task manager, which process is consuming high CPU. If it’s SQL server then open perfmon and add the counter for Processor: % Privileged time, % Processor time and %user time

If you see the %privileged time to be high like hovering around 40% and above then it could be some driver issue. To troubleshoot these kinds of issues you could try:

Using Xperf: - https://blogs.msdn.com/b/ntdebugging/archive/2010/03/22/using-xperf-to-root-cause-cpu-consumption.aspx -- Video; https://msdn.microsoft.com/en-us/performance/cc709422

If it’s user time which is high then the below steps will be applicable for troubleshooting:

Step 2: - First and very important one is to know what’s the architecture i.e. 64 bit (X64 or IA 64) or 32 bit(x86):

Check the configuration using command: - Xp_msver

In this blog I will talk specifically about SQL server 2008 64-bit.

Step 3: - Check if there are any trace flags enabled:

dbcc tracestatus (-1); It’s always good to check the reason of enabling the trace flags, if enabled.

Step 4: - Check the below mentioned SQL parameters

sp_configure 'advanced',1

go

reconfigure with override

go

sp_configure

go

  1. Priority Boost: - it’s not recommended to be enabled as it can cause high CPU utilization. If it’s enabled (value =1) please disable(set value to 0) it - it’s a static parameter which needs SQL restart
  2. Max degree of parallelism: - Although this parameter doesn’t make much of a difference but sometimes I have seen SPID stuck in ACCESS_METHODS_SCAN_RANGE_GENERATOR, cxpacket or other parallel waittypes. If you see any such waits when you run this query :

Select a.session_id, a.wait_type, a.wait_time, a.* from sys.dm_exec_requests a, sys.dm_exec_sessions b

where a.session_id=b.session_id and b.is_user_process=1

If in your case waittypes to be any of related to parallelism and the waittime is more than 20 ms reduce the parallelism to e.g. if 8 then to 4 and so on unless the waits go away.

Note: - If you have more than 8 processors, please make sure the max degree of parallelism parameter is 8 or below 8.

  1. Ensure that you have set max server memory by leaving minimum 3-4 GB for the OS

Step 4: - Run dbcc memorystatus and check for the USERSTORE_TOKENPERM memory clerk. If the value is high say above 400 MB then follow the article: - https://support.microsoft.com/kb/927396

Step 5: - If everything mentioned above is alright then next thing we have to check is compilations and recompilations :

  1. Open perfmon and add the counters: -

MSSQL: SQL statistics à Batch requests/s, compilations/s and recompilations/s.

If you notice compilations/sec are greater than 20% then there is definitely fair amount of chances for Procedure cache pollution. To confirm that please run the query:

WITH cached_plans (cacheobjtype, objtype, usecounts, size_in_bytes, dbid, objectid, short_qry_text) AS(

SELECT p.cacheobjtype, p.objtype, p.usecounts, size_in_bytes, s.dbid, s.objectid,

CONVERT (nvarchar(100), REPLACE (REPLACE (

CASE

-- Special cases: handle NULL s.[text] and 'SET NOEXEC'

WHEN s.[text] IS NULL THEN NULL

WHEN CHARINDEX ('noexec', SUBSTRING (s.[text], 1, 200)) > 0 THEN SUBSTRING (s.[text], 1, 40)

-- CASE #1: sp_executesql (query text passed in as 1st parameter)

WHEN (CHARINDEX ('sp_executesql', SUBSTRING (s.[text], 1, 200)) > 0)

THEN SUBSTRING (s.[text], CHARINDEX ('exec', SUBSTRING (s.[text], 1, 200)), 60)

-- CASE #3: any other stored proc -- strip off any parameters

WHEN CHARINDEX ('exec ', SUBSTRING (s.[text], 1, 200)) > 0

THEN SUBSTRING (s.[text], CHARINDEX ('exec', SUBSTRING (s.[text], 1, 4000)),

CHARINDEX (' ', SUBSTRING (SUBSTRING (s.[text], 1, 200) + ' ', CHARINDEX ('exec', SUBSTRING (s.[text], 1, 500)), 200), 9) )

-- CASE #4: stored proc that starts with common prefix 'sp%' instead of 'exec'

WHEN SUBSTRING (s.[text], 1, 2) IN ('sp', 'xp', 'usp')

THEN SUBSTRING (s.[text], 1, CHARINDEX (' ', SUBSTRING (s.[text], 1, 200) + ' '))

-- CASE #5: ad hoc UPD/INS/DEL query (on average, updates/inserts/deletes usually

-- need a shorter substring to avoid hitting parameters)

WHEN SUBSTRING (s.[text], 1, 30) LIKE '%UPDATE %' OR SUBSTRING (s.[text], 1, 30) LIKE '%INSERT %'

OR SUBSTRING (s.[text], 1, 30) LIKE '%DELETE %'

THEN SUBSTRING (s.[text], 1, 30)

-- CASE #6: other ad hoc query

ELSE SUBSTRING (s.[text], 1, 45)

END

, CHAR (10), ' '), CHAR (13), ' ')) AS short_qry_text

FROM sys.dm_exec_cached_plans p

CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) s

)

SELECT COUNT(*) AS plan_count, SUM (size_in_bytes) AS total_size_in_bytes,

cacheobjtype, objtype, usecounts, dbid, objectid, short_qry_text

FROM cached_plans

GROUP BY cacheobjtype, objtype, usecounts, dbid, objectid, short_qry_text

HAVING COUNT(*) > 100

ORDER BY COUNT(*) DESC

RAISERROR ('', 0, 1) WITH NOWAIT

In the output you will need check:

Plan_count : - number of times the similar plan was found in the procedure cache

use count : - How many times the plan was reused and in the output of this query you will see the use_count = 1 most of the times which will tell that the query was not being reused.

Total_size_in_bytes: - It’s the memory consumed by the execution plans of the similar kinds.

For the description of the counters please check: - https://msdn.microsoft.com/en-us/library/aa173933(v=sql.80).aspx

In many cases you will find this to be the problem.

If you see the procedure cache pollution then you need to check the memory clerk: CACHESTORE_SQLCP in “dbcc memorystatus” command output (run in query analyzer).

In 64 bit system I have seen this memory to grow up to 8GB or more which is sign of excessive compilation. After you confirm this, there can be two possibilities

  1. Memory pressure which can cause cleanup of the execution plans and make SQL to compile the plan and in this case there will be no procedure cache pollution.
  2. There are excessive Ad-hoc queries which are causing this compilation and you will see high pollution in the queries.

For point number 2, there are steps which could be followed:

  1. Identify the text and database of the query which is being compiled in the output of the above mentioned query and set parameterization to forced
  2. Parameterize the ad-hoc query :- https://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

If you see high re-compilations then check the article : - https://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx .

Also, make sure statistics of the tables are updated with fullscan. To check the statistics updated date, run the query:

EXEC sp_msforeachtable 'select object_name(object_id) tablename,name as index_name,stats_date(object_id,index_id) as statistics_update_date from sys.indexes where Object_id=object_id("?")'

If after implementing the above action plans issue doesn’t resolve then:

Check for the hashing and sorting in SQL server:

Open perfmon and add the counters: - MSSQL: memory manager (Granted workspace area) or check the output of the query:

select * from sys.dm_os_memory_clerks where type='MEMORYCLERK_SQLQERESERVATIONS'

If the value is very high e.g. more than like 3-4 GB and I have seen this value to go till 10 GB in high memory machines. It’s a clear sign of big hash tables /huge sorting being done.

Then run the query (only for sql server 2008) : -

  1. Run the query :

print '-- query and plan hash capture --'

print '-- query and plan hash capture --'

print '-- top 10 CPU by query_hash --'

select getdate() as runtime, *  --into tbl_QueryHashByCPU

from

(

SELECT TOP 10 query_hash, COUNT (distinct query_plan_hash) as 'distinctquery_plan_hash count',

sum(execution_count) as 'execution_count',

       sum(total_worker_time) as 'total_worker_time',

       SUM(total_elapsed_time) as 'total_elapsed_time',

       SUM (total_logical_reads) as 'total_logical_reads',

       

    max(REPLACE (REPLACE (SUBSTRING (st.[text], qs.statement_start_offset/2 + 1,

      CASE WHEN qs.statement_end_offset = -1 THEN LEN (CONVERT(nvarchar(max), st.[text]))

        ELSE qs.statement_end_offset/2 - qs.statement_start_offset/2 + 1

      END), CHAR(13), ' '), CHAR(10), ' '))  AS sample_statement_text

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st

group by query_hash

ORDER BY sum(total_worker_time) DESC

) t

go

Find out which one is on the TOP and then run the below query to find the missing indexes:

DECLARE @runtime datetime

DECLARE @cpu_time_start bigint, @cpu_time bigint, @elapsed_time_start bigint, @rowcount bigint

DECLARE @queryduration int, @qrydurationwarnthreshold int

DECLARE @querystarttime datetime

SET @runtime = GETDATE()

SET @qrydurationwarnthreshold = 5000

PRINT ''

PRINT '==============================================================================================='

PRINT 'Missing Indexes: '

PRINT 'The "improvement_measure" column is an indicator of the (estimated) improvement that might '

PRINT 'be seen if the index was created. This is a unitless number, and has meaning only relative '

PRINT 'the same number for other indexes. The measure is a combination of the avg_total_user_cost, '

PRINT 'avg_user_impact, user_seeks, and user_scans columns in sys.dm_db_missing_index_group_stats.'

PRINT ''

PRINT '-- Missing Indexes --'

SELECT CONVERT (varchar, @runtime, 126) AS runtime,

mig.index_group_handle, mid.index_handle,

CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) AS improvement_measure,

'CREATE INDEX missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle)

+ ' ON ' + mid.statement

+ ' (' + ISNULL (mid.equality_columns,'')

+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END + ISNULL (mid.inequality_columns, '')

+ ')'

+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,

migs.*, mid.database_id, mid.[object_id]

FROM sys.dm_db_missing_index_groups mig

INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle

INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle

WHERE CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) > 10

ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC

PRINT ''

GO

For information on the DMV’s used in the query, please check https://msdn.microsoft.com/en-us/library/ms345421.aspx

You will definitely find some index recommendations for the TOP queries. Create the indexes and see if that makes the difference.

If creating the indexes doesn’t help then you will have to find the execution plan of the Top in the Top CPU query and try to tune it.

Comments

  • Anonymous
    August 19, 2011
    The comment has been removed
  • Anonymous
    July 31, 2012
    The comment has been removed
  • Anonymous
    December 03, 2013
    The comment has been removed
  • Anonymous
    February 12, 2014
    The comment has been removed
  • Anonymous
    February 20, 2014
    The comment has been removed
  • Anonymous
    December 23, 2014
    Thanks Guys!! I have started my personal blog at new address : http://dbcouncil.net