Поделиться через


“What are you waiting for?” – Introducing wait stats support in Query Store

Troubleshooting waits previously.

Nobody likes to wait. SQL database is multithreaded system that can handle thousands of queries executed simultaneously. Since queries that are executed in parallel compete for the same resources (tables, memory, etc) they might need to wait for the resources to be available to proceed with execution. These cumulative waits can be very large and downgrade the whole database performance.

There are more than 900 wait types in SQL Server. Some are more important/frequent than others.

For a long time, the only way you could get closer to understanding what is waiting bottleneck of your workload was to look at instance (sys.dm_os_wait_stats) or recently added session (sys.dm_exec_session_wait_stats) level wait statistics. These options have certain limitations and might not provide optimal experience due to the following reasons:

  • It is not possible to deterministically correlate wait stats with specific queries.
  • These wait stats are kept only in memory, and are completely lost on server restarts and failovers, or when sessions are killed

Adding wait statistics in query store was one of the major asks from the community since we first released Query Store.

What changed?

Starting today in Azure SQL Database and from CTP 2.0 of SQL Server 2017 wait stats per query are available in Query Store. Now you can exactly identify why and how much every plan waited for some resource. Information about wait times are persisted so you can also analyze through the history what was the problems and why queries waited for resources

Wait categories explained

When we were designing this feature, two possible implementations were considered.
First and naïve approach, would have been to track all wait types statistics for every query. This approach has some drawbacks:

  • Presenting the wait stats in their original current format might not have the expected value add for the customer, as it would imply that the customer would have to understand wait stats in detail, which are not intuitive.
  • Due to a high number of different wait types, which will further grow in the future, it would not be possible to track them all separately in Query Store without introducing a significant performance and resource utilization overhead. As a side effect, Query Store will get into read-only state and it won’t be feasible to have it on by default in Azure SQL Database.

To address these issues, alternative approach was chosen. We are introducing wait categories, described here sys.query_store_wait_stats (Transact-SQL)

Wait categories are combining different wait types into buckets similar by nature. Different wait categories require a different follow up analysis to resolve the issue, but wait types from the same category lead to very similar troubleshooting experiences, and providing the affected query on top of waits would be the missing piece to complete the majority of such investigations successfully.

Besides, there are far less categories than actual wait types, so we address Query Store storage concerns.

Full mapping between wait types and wait categories is available here

Usage

 -- Let's get queries ordered by total waiting time
select 
 qt.query_text_id,
   q.query_id,
 p.plan_id,
  sum(total_query_wait_time_ms) as sum_total_wait_ms
from sys.query_store_wait_stats ws
join sys.query_store_plan p on ws.plan_id = p.plan_id
join sys.query_store_query q on p.query_id = q.query_id
join sys.query_store_query_text qt on q.query_text_id = qt.query_text_id
group by qt.query_text_id, q.query_id, p.plan_id
order by sum_total_wait_ms desc


-- Query with plan id = 8 has highest wait time, let's what wait categories contribute 
-- to this total time
select wait_category_desc, sum(total_query_wait_time_ms)
from sys.query_store_wait_stats
where plan_id = 8
group by wait_category_desc

-- Aha, it's locking, let's see actual query text for the above plan
select query_sql_text
from sys.query_store_query_text
where query_text_id = 74

-- Query text is (@2 tinyint,@1 int)UPDATE [testtbl] set [c1] = @1  WHERE [id]<@2 
-- Let's see other queries that access this table

select query_text_id, query_sql_text
from sys.query_store_query_text
where query_sql_text like '%testtbl%'

-- Now we have list of queries that could try to access this table at the same time,
-- so consider changing the application logic to improve concurrency, or use a less restrictive isolation level.

Here are some examples how performance troubleshooting workflows utilize new wait categories

Current experience New experience Action
High RESOURCE_SEMAPHORE waits per database High Memory waits in Query Store for specific queries Find the top memory consuming queries in Query Store. These queries are probably delaying further progress of the affected queries. Consider using MAX_GRANT_PERCENT query hint for these queries, or for the affected queries.
High LCK_M_X waits per database High Lock waits in Query Store for specific queries Check the query texts for the affected queries and identify the target entities. Look in Query Store for other queries modifying the same entity, which are executed frequently and/or have high duration. After identifying these queries, consider changing the application logic to improve concurrency, or use a less restrictive isolation level.
High PAGEIOLATCH_SH waits per database High Buffer IO waits in Query Store for specific queries Find the queries with a high number of physical reads in Query Store. If they match the queries with high IO waits, consider introducing an index on the underlying entity, in order to do seeks instead of scans, and thus minimize the IO overhead of the queries.
High SOS_SCHEDULER_YIELD waits per database High CPU waits in Query Store for specific queries Find the top CPU consuming queries in Query Store. Among them, identify the queries for which high CPU trend correlates with high CPU waits for the affected queries. Focus on optimizing those queries – there could be a plan regression, or perhaps a missing index.

Next steps

You can access wait stats information in your Azure SQL Database and SQL Server 2017 starting from CTP 2.0 using T-SQL.
We encourage you to try it out and see what learnings and optimizations wait stats categories in Query Store can bring.

Currently, we are working on expanding existing Query Store reports in SQL Server Management Studio (SSMS). These reports are going to expose new information in a friendly, easy to use way.
We are very excited and looking forward to hearing your feedback.

Comments

  • Anonymous
    July 03, 2017
    Hello,Thanks for this info.Any chance whether these might be available in sqlserver 2016 by any cu upgrade or something like that.I agree sql2017 also runs on windows,but i don't think it has all features of sqlserver 2016 .
  • Anonymous
    July 03, 2017
    Good stuff, thanks for the heads-up!But I still want profiler in Azure. I know, "managed instances" is coming soon ...
  • Anonymous
    July 22, 2017
    Nice article. Thanks for taking the time to post it.I've found that, typically, it's not what's doing the waiting that has a problem. Rather, it's the resource hog code that's currently running and that everything else is waiting on that's the problem. IMHO, it's better to look for things like duration, CPU usage, and number of logical and physical reads than waits.
    • Anonymous
      September 27, 2017
      The comment has been removed
  • Anonymous
    December 07, 2017
    We have databases spread across six elastic pools so anything that can help me get closer to quickly understand what is causing those max/near max spikes in a pool is golden.