Becoming an WPA Xpert Part 11: Troubleshooting Long Group Policy Processing
Hey Folks,
Though I’ve blogged before here I have been a bit rude and never properly introduced myself. Hi, I am James Klepikow, a Senior Premier Field Engineer with Microsoft. Nice to meet all of you. One of the many issues we troubleshoot as Field Engineers is slow boot up time. As you can see we are now at part 11 in our series.
Recently I was engaged on an issue at one of my customers where they were experiencing delays at computer startup. We noticed in the GroupPolicy Event Log many 5017s which contained delays of up to a minute and over!
If you are not familiar with what an Event 5017 is, it’s basically an event informing us that the system call to get account information was completed successfully by the Group Policy service. The event will also record the length of the delay. If you see delays, it could warrant investigation, especially on a SSD drive.
At that point, we decided to gather a WPR trace (that post sure looks familiar). The latest version of the Windows Performance Toolkit can be found here if you do not have it already. We used the following checkboxes for our capture:
Note: When viewing etls with WPA, make sure your symbols are configured appropriately.
Once we gathered our trace, we then use the Windows Performance Analyzer to help us understand where the delays came from. If you are not familiar with the Windows Performance Analyzer, don’t be scared! It’s basically a bunch of data and graphs that helps debugging complex performance problems easier. So, let’s get to the analysis…..
When we added the “Boot Phases” and “Windows Logon” graphs, this helped us break down the delay.
In the Boot Phases graph, we noticed most of the time was spent in the phase “Winlogon Init.” This phase begins when SMSSInit completes and starts winlogon.exe. During this phase, the user logon screen appears, the service control manager starts services, group policy processing occurs, and group policy scripts run. You can see below the duration of this phase was ~90 seconds.
Since the delay was in the Winlogon Init Phase, we reviewed the “Windows Logon” graph to break down the delay even further. Here, we saw the Total Notification Time of ~76 seconds for the GPClient Subscriber. The “Create Session” notification to the GPClient subscriber is essentially the time spent for computer based policy. Likewise, the “Logon” notification to the GPClient subscriber is the time spent for user based policy.
At this point, we knew our delay was coming from something during the processing of computer based group policies.
Let’s circle back to the event 5017. If we add the “Generic Events” graph, we will see the events for every provider captured. So, in our case we can see when the 5017 ocurred as well as the Process ID(1468) and Thread ID(1576) that did the work.
In the Generic Events for Microsoft-Windows-Group Policy, we clearly saw the big delay for the posting of Event 5017 of 63321ms under the OperationElapsedTimeInMilliSeconds column.
Note: I find if you add the following colums in this order to the Generic Events table, it makes the determination of timing a bit easier: Provider Name, Process, Time, Thread ID.
At this point, we knew there was this wait of 63 seconds so we needed to use wait analysis to determine the “why?”
To do this, we added the CPU (Precise) Utilization by Process and Thread graph and started our Wait Analysis.
I will focus on the svchost (1468) and thread (1576) since this is the area of the wait obtained from the Generic Events graph. The goal here is to trace the wait to the idle ReadyingProcess.
Note: To get your graph to match, make sure you add the following colums: NewProcess, NewThreadId, NewThreadStack, SwitchInTime, ReadyingProcess, ReadyingThreadID, and ReadyThreadSatck. Also, add Count and Waits(sum) to the golden bar.
You can see above that we end on a lsass thread.
Following the waits on lsass.exe PID 696 TID 848:
Following the waits on lsass.exe PID 696 TID 804:
Following svchost threads. We see above that we have 2 waits totaling about 42 seconds combined. Let’s take a look at one of these svchost threads to see what it’s doing.
In the above graph we could see 2 DNS waits of some kind since we are calling dnsapi.dll. Also the function called “ResolverQuery,” made us realize the wait was part of DNS Resolution.
Taking a look at the idle thread revealed some sort of timeout:
Since I enabled the network checkbox in WPR, I also have the Generic Events for Microsoft-Windows-TCPIP. Here, I noticed many packet drops and retransmits up to the customer’s event 5017. Note: Customer’s IP addresses were removed below but there was much evidence in the table to show the lack of connectivity.
At this point, my gut feeling was firewall. One thing which got our attention was the following checkbox with the customer’s Symantec product.
Under the “UI delays” graph, we could clearly see there was a COM call happening during our gap. We could also see connectivity resume when the Symantec CMC Firewall starts:
The “Transient Lifetime By Process, Image” graph helped us isolate the time when the firewall started.
After we unchecked the checkbox, our boot up times decreased dramatically. After research, I found the following TECH doc.
https://www.symantec.com/docs/TECH199907
The article specifically calls out the following as best practice:
“This setting should be used in conjunction with a location awareness policy and only used in scenarios in which the client may be connecting to a public network or an environment beyond the control of the enterprise network administrators. It is not recommended for use inside a corporate network. Most enterprise domains require additional authentication or connection requirements at boot and log-in sequences using NTLM and kereberos. Additional security policy items may require connection to a domain controller to even allow a machine to be logged unto or unlocked. Use of this setting in such an environment can cause significant delays in the ability to log into a machine, and in some scenarios may make log-in impossible. “
Here is a shot of the results:
Here you can see we improved the “Winlogon Init” phase by 83 seconds!!!! Gotta love those evil checkboxes.
Until next time.
Cheers,
James Klepikow Platforms PFE