Jaa


SCCM 2012 SQL performance got you down after applying SP1?

CM2012 SP1 introduced some changes to the stored procedures responsible for locating a distribution point on behalf of the client. There are two stored procedures responsible for content lookup - MPGetContentDPInfoUnprotected and MPGetContentDPInfoprotected. In my case, it was MPGetContentDPInfoUnprotected that was impacting the server. As a result of the changes in SP1, this significantly increased the demand on SQL's TempDB usage. If you have a sufficiently sized environment and depending on how aggressive you are when deploying packages/apps/updates, you may experience this issue. The CU2 update for SP1 addresses this issue.

What are some of the symptoms? Viewing SQL Activity Monitor you may see something like:

 

Eventually over time this becomes…..

 

Activity Monitor allows me to easily see “expensive" queries. If you right-click the suspect query contributing to this issue, selecting Edit Text allows you to view the query to see what is contained within. From above, I can see there is one in particular that has a large average duration time – 77 seconds in my example – this happens to be one of the queries that is spawned by the stored procedure in question. Obviously, 77 seconds to return data for the clients local DP(s) is excessive even without knowing how long this is “supposed” to take. The data above in the first Activity Monitor capture actually shows a server in the beginning phases of falling into a race condition. From what I noticed, once my customers SQL server got above 300-400 SQL waiting tasks, database I/O began to fall which then only causes more clients becoming queued waiting for a response – and so on…. Once we have a significant number of waiting tasks, all performance on the site server begins to degrade. This includes backlogs in DDM, state messages, inventory, etc... In addition, once you hit 900 waiting tasks, even Activity Monitor won't function properly and opening SQL Management Studio becomes a challenge.

One obvious way to stop this behavior is to disable the deployment that’s causing the load. This should be done directly on the primary as replication with the CAS will be hampered during this condition. I know, not a good choice! You can also limit the # of connections in IIS on the management point(s). Around 300 connections across all three management points, in my case, allowed the server to reach a workable steady state without accumulating wait states. Clients attempting to connect with the MP's when IIS is configured with a limited number of allowed connections will fail/retry but will eventually succeed. This is not a great solution as anyone trying to install an app via Software Center is likely to encounter an error and will need to retry numerous times before being lucky enough to establish a connection. All this behavior depends on your particular hardware, number of clients on assigned to the primary, policy polling interval, # of apps/updates/packages you are attempting to deploy. Other than for the purposes of getting the server back under control, this is just a generally bad idea.

Another potential source (or contributing factor) to this issue could be the MSI self healing feature within Configuration Manager - you know, that cool little feature that's been around since SMS 2003 that redirects clients to the local DP for self healing of MSI's. I was reminded of this by a fellow PFE, Heath Lawson, who's customer is also impacted by this issue. Apparently, if Source Update Manager on the client doesn't find the content on a DP, it goes from 1x daily to 24x daily content location requests per "source enabled" app/package. OUCH! In case you don't know what this feature is:

 Application/Deployment Type:

Package/Program:

 

 

 

Ok, after all that background, to mitigate this issue (not an exhaustive list I'm sure), one can:

1)      Be careful and slowly roll out packages/updates as to not have too many clients requesting content in close successive order. Creating numerous collections/deployments to load balance the situation is typically used in large environments anyway to avoid any potential network impact. Be self aware of what you're deploying. The number of software updates applicable to clients in your environment can vary drastically month-to-month

2)      Follow Microsoft best practice when architecting your SQL Server. For this particular issue, the query in question places a large burden on TempDB I/O. For my customer, we increased TempDB files from 1 to 16 files to match the number of processor cores (https://support.microsoft.com/kb/2154845). This doubled the disk performance on TempDB. And, this allowed us to distribute a package to the some 60,000 clients assigned to this particular primary site within the two hour policy polling interval. However, because the problematic stored procedure is run for each software update the client is attempting to install, patch Tuesday’s deployments still may not properly scale.

3)      Increase your SQL disk performance by moving SQL components to better hardware. However, throwing hardware at a problem that should not exist when you have followed design best practice should always be the last resort.

4)      Install secondary sites to spread the content location request load. In my particular case, other primary servers that are also hosting a fairly large client base are not impacted by this issue. It turns out, these sites have a fair number of secondary sites that are taking a significant portion of the content location requests burden.

5)      Place SQL replicas on your remote management point to distribute the SQL pressure. I haven’t personally tested this, however this should theoretically work akin to deploying secondary sites and also a supported solution. See https://technet.microsoft.com/en-us/library/hh846234.aspx

6)      Verify you are not using a large number of IP range boundaries. This in itself can be the source of your performance problem as the stored procedure involved runs much slower when it has to calculate whether you fall into an IP range as opposed to a subnet or AD site boundary. See https://blogs.technet.com/b/configmgrteam/archive/2013/03/01/when-not-to-use-ip-address-ranges-as-boundaries-in-configuration-manager.aspx?PageIndex=2#comments (specifically read Jason Adams reply to his own blog post).

7)       Install CU2 to gain back the performance that was lost with SP1. I performed some limited analysis (below) in a test environment of the MPGetContentDPInfoUnprotected procedure. Some have reported even more stored procedure performance gains than I received with the installation of CU2.

Obviously, option 2 and 7 will usually be the easiest and most impactful without slowing down deployments.

Let’s take a look what CU2 gives us. The analysis below was run using SQL Profiler while filtering on the stored procedure within a test environment consisting of one client. In my case, I was testing an application that was dependent on another application. Thus, we have two MPGetContentDPInfoUnprocected executions every time I installed my application via Software Center.

SP1+CU1 MPGetContentDPInfoUnprotected Performance (4 samples)

 

  Avg stored procedure run duration = 640.125 ms

 

SP1+CU2 MPGetContentDPInfoUnprotected Performance (4 samples)

Avg stored procedure run duration = 306.375 ms

 

As you can see, CU2 yields a significant performance increase.

Additional information:

In my case, I wanted to see what package/app/update was causing the numerous stored procedure executions. Or, at least, what is the “top talker”. If you view the TextData on the stored procedure within SQL Profiler, the data being passed to the stored procedure is contained within (includes the clients boundary, for instance):

 You can see the UID of the content being requested. This is easy to track if it’s a package as it’s just your Package ID. However, for software updates we need to determine what this “UID” matches to as far as a patch. I have found that it does not match the unique ID number of the patch with the Software Updates node (even with the Unique Update ID column enabled). However, we can find out what the content being queried for by the client by taking the UID and searching the CI_Contents table using SQL Management Studio (If you don't know how to do this, you probably shouldn't being playing in here anyway as it can be dangerous...you've been warned). For example,

 

 

Now we know the Content_ID (40827 in our case). We can query CI_ContentFiles for the content_ID and we get:

 

 The patch in this case being queried for a DP is kb2844286. This is one of the security updates released in July 2013 Maybe there’s an easier way, but I got bored looking in the console :)

Comments

  • Anonymous
    January 01, 2003
    Great explanation. We run in the same problem but with CU4 installed. We have 15'000 Agents, 1'100 packages, 280 Remote Distribution Points and 1'200 Subnet Boundaries. After activating our automaticaly Boundarymanagement, the SQL performance problem startet to happen. After reducing the 1'200 Boundary to one big Subnet Boundary, the performance was significant better and came into the regular base. In our case, the secondary Sites would probably the only way to get out of this situation. Performance optimisation has allready done on the largest HW that is awailable on the market. Thanks again for your analyzing and possible solutions. It helps us to get our environment running.

  • Anonymous
    August 05, 2013
    Great Job OZ! Excellent evaluation

  • Anonymous
    October 27, 2015
    Hey Oz,

    We believe we are experiencing significantly worse performance of the two content lookup SProc's since updating to SCCM 2012 R2 Sp1 Cu1 (from 2012 R2) . Have you had a chance to perform any analysis of the sproc's in an R2 SP1 environment?

    FYI - we have a premier support case open in regards to this exact issue.

    Will

  • Anonymous
    January 11, 2016
    Thanks for this, Oz! (Ozzy! Ozzy! Ozzy! Ozzy! Long time no talk!)

    We were experiencing pretty horrific performance on 2012 SP1 CU1 in the SQL space. Jumping to CU5 seemed to help quite a bit. We still are experiencing constraints on our SQL side but I believe those may be a SAN throughput issue.