Sampling SQL server batch activity
Recently I was troubleshooting a performance issue on an internal tracking workload and needed to collect some very low level events over a period of 3-4 hours. During analysis of the data I found that a common pattern I was using was to find a batch with a duration that was longer than average and follow all the events it produced.
This pattern got me thinking that I was discarding a substantial amount of event data that had been collected, and that it would be great to be able to reduce the collection overhead on the server if I could still get all activity from some batches.
In the past I’ve used a sampling technique based on the counter predicate to build a baseline of overall activity (see Mikes post here). This isn’t exactly what I want though as there would certainly be events from a particular batch that wouldn’t pass the predicate. What I need is a way to identify streams of work and select say one in ten of them to watch, and sql server provides just such a mechanism: session_id. Session_id is a server assigned integer that is bound to a connection at login and lasts until logout. So by combining the session_id predicate source and the divides_by_uint64 predicate comparator we can limit collection, and still get all the events in batches for investigation.
CREATE EVENT SESSION session_10_percent ON SERVER
ADD EVENT sqlserver.sql_statement_starting(
WHERE (package0.divides_by_uint64(sqlserver.session_id,10))),
ADD EVENT sqlos.wait_info (
WHERE (package0.divides_by_uint64(sqlserver.session_id,10))),
ADD EVENT sqlos.wait_info_external (
WHERE (package0.divides_by_uint64(sqlserver.session_id,10))),
ADD EVENT sqlserver.sql_statement_completed(
WHERE (package0.divides_by_uint64(sqlserver.session_id,10)))
ADD TARGET ring_buffer
WITH (MAX_DISPATCH_LATENCY=30 SECONDS,TRACK_CAUSALITY=ON)
GO
There we go; event collection is reduced while still providing enough information to find the root of the problem.
By the way the performance issue turned out to be an IO issue, and the session definition above was more than enough to show long waits on PAGEIOLATCH*.