次の方法で共有


AppDomain unloading messages flooding the SQL Server error log

This blog is built directly from a customer reported issue.  As I helped investigate the source of the issue I thought it would be of interest to a broader audience – hopefully you find this interesting, as well.

Allow me to provide some history of the problem before I dive into extended details.

The SQL Server error log was flooded with the following pattern, approximately every ~1 second.  Customer indicated lots of CLR_* wait types and mini-dumps revealed heavy GC (garbage collection) activity taking place.

...
2013-01-22 08:07:44.91 spid37s AppDomain 890288 (mssqlsystemresource.dbo[ddl].890287) unloaded.
2013-01-22 08:07:45.73 spid31s AppDomain 890289 (mssqlsystemresource.dbo[ddl].890288) unloaded.
2013-01-22 08:07:46.41 spid34s AppDomain 890290 (mssqlsystemresource.dbo[ddl].890289) unloaded.

...

The domain name was always the (same).  Domain names are based on database and object owner and given a generation id each time they are loaded.   The pattern shown above indicates the SQL Server has loaded and unloaded the mssqlsystemresource (DBO) application domain 890,000+ times – ouch!

This is not the normal pattern, as you might have imagined already.   The normal pattern is to show the domain loaded message paired with a matching unloaded message and separated by a reasonable length of time as the domain is utilized.  The error log was not showing any loaded messages, just the unloaded messages.

The other distinct difference that stood out is the text of the unloaded message.  There are ~6 different unloaded messages which the SQL Server can log.  All the other messages indicate a reason (out of memory, .NET exception, locking protocol violation, ….)  This was just generically unloaded without any reason given and exactly why the customer wanted to know what the cause of the issue was.

Note: The stack traces and debugging activities are using public symbols and a SQL Server 2008 build. https://support.microsoft.com/kb/311503 

SQL Server AppDomain Loaded Message

SQL Server does not log the application domain loaded message until the domain has been loaded and initialized.  Loading and initializing the CLR interfaces for an application domain are two distinct states.  This is important because technically the application domain is loaded within the CLR runtime (not fully initialized) and SQL Server does NOT record the application domain loaded message in the error log.

My Theory

Based on the behavior I had a hunch that this is a query cancellation or error occurring during the loading of the application domain and SQL Server didn’t reach full initialization so SQL Server is just starting up the domain (part way) and tearing it down (unloading) to cleanup properly.

Asking the customer what they were doing with CLR and looking at some traces (.TRC) the geography data type was in use and the supporting assembly is associated with a system database.

I was also able to use the sys.dm_os_ring_buffers to get information related to the application domain state machine changes.  I found the domains were transitioning from creating to unloaded in within a few milliseconds.

Ring Buffer: RING_BUFFER_CLRAPPDOMAIN

image

Ring Buffer: RING_BUFFER_EXCEPTION

SQL Server handles the vast majority of error conditions by throwing the custom C++ exception type (SQL Exception). The exception holds a major and minor error code along with other information associated with the condition. You can use the formula (Major * 100) + (minor) to build the SQL Server error code or message_id as provided in sys.messages .

For a query cancellation this is an internal error (3617) or Major = 36 and Minor = 17.

image

The stack associated with the cancellation is the following, during the initialization of the application domain.

sqlservr!clr_ex_raise   <------- Major = 36, Minor = 17
sqlservr!CAppDomain::CreateManagedDomain
sqlservr!CAppDomain::InitExpensive
sqlservr!CAppDomainManager::GetAppDomain
sqlservr!CCLRHost::GetAppDomain
sqlservr!CAssemblyMetaInfo::GetAppDomainForVerification
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces
sqlservr!CAssemblyMetaInfo::InitClrInterfaces
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase

Associated with the CLR_APPDOMAIN, ring buffer are stack frames and I was able to see where the unload was getting triggered.

sqlservr!AppDomainRingBufferRecord::StoreRecord+0x9c
sqlservr!CAppDomain::StateTransition+0xcf
sqlservr!CAppDomainManager::AppDomainStateTransitionLockHeld+0xa5
sqlservr!CAppDomainManager::AppDomainStateTransition+0x30
sqlservr!CAppDomain::UnloadManaged+0x22
sqlservr!CAppDomain::Release+0xd3
sqlservr!CAutoRefc<CAppDomain>::~CAutoRefc<CAppDomain>+0x914a3f
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces+0x135
sqlservr!CAssemblyMetaInfo::InitClrInterfaces+0x7b
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0x144
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0xa2
sqlservr!ResolveUdf+0x2c4
sqlservr!CAlgUtils::TrpGetExpressionPropsAlg+0x523
sqlservr!CAlgUtils::TrpGetExpressionPropsWithHandler+0x112
sqlservr!udf::FBindObject+0x5d
sqlservr!udf::FBind+0x2b3

A few key aspects of this stack of note.  

  • FBind – Used during compile so we are still compiling the query, not an execution/runtime issue so we can use a database clone to reproduce the problem.
  • LoadAssemblyFromDatabase – Loading the assembly which can create the application domain and in this case is doing just that
  • CreateClrInterfaces – Doing the initialization work – (We have not printed the AppDomain loaded message yet)
  • UnloadManaged – Triggers an async application domain unload.  It is important to note this is an async operation and the actual unload will occur on a system thread and exactly why the unloaded messages are occurring on a system (s) thread.

Now I needed to exercise this code path to validate my findings.  I know if I restarted SQL Server the application domain would not be loaded and the plan would not be in procedure cache.

A coworker of mine (BillHol) assisted in creating a simple function that causes the Microsoft.SqlServer.SqlGeography assembly to be loaded under the mssqlserversystem.dbo application domain.  (In SQL 2012 this loads under master.dbo.)

use tempdb
go

CREATE FUNCTION dbo.funcBufferGeography(@p1 geography, @p2 float)
RETURNS geography
AS
BEGIN
       DECLARE @g geography;
       DECLARE @distance float;
       SELECT @g = @p1.STAsText();
      
       RETURN (@g);
END;
GO

I then went to the debugger and set a breakpoint.  I wanted to create a simulated stall during the interface creation to see if I could trigger a query cancellation (attention) during this phase of the compile and would it result in a reproduction of the pattern.  The breakpoint is just causing a break to wait 1.5 sec and then continue.

bp sqlservr!CAssemblyMetaInfo::CreateClrInterfaces ".sleep 1500;g"

I used OSTRESS, from the RML toolkit, to execute the function; providing a query timeout of 1 second so while the debugger has the SQL Server process stopped the client query timeout can predictably occur.

ostress -dtempdb -E -S.\sql2008 -Q"DECLARE @garg geography = 'LINESTRING(3 4, 8 11)'; SELECT dbo.funcBufferGeography(@garg, 1.1)" -oc:\temp\breakout –t1

Sure enough I was able to reproduce the behavior in the error log of repeated, unloaded messages without the matching loaded paring.  In the debugger you can also see the C++ exceptions being thrown for the 3617 SQL Exception as well.

I can’t/don’t want to hook up a debugger to production!

It is not practical to hook up the debugger on the production instance but once we understand the pattern it is easy to see from a simple SQL Server trace and the ring buffer entries (previously shown.)

image

  • The attention (internal 3617 C++ SQL Exception) is always logged after the BatchCompleted event and the Attention event.  As you can see each BatchCompleted is followed by an attention. 
  • There is never an error log message for AppDomain loading.
  • The unloading occurs on SPID=31 and it is marked (IsSystem = 1) maching the output in the error log of 31(s)

Garbage Collection (GC) and CLR_* Wait Types

While loading, initializing or unloading an application domain the SQL Server prevents additional CLR activity against the same application domain.  This results in the CLR_* wait types as you might expect; only one thread can load the application domain.  This is no different than loading a DLL and the operating system maintaining the process, loader lock (CriticalSection) during the image load and resolution processing.

In this customers case the mini-dump and sys.dm_exec_requests output revealed 34 additional waiters on the geography data type (mssqlsystemresource.dbo app domain.)

GC/Convoy: The GC activity is a side effect but helped cause a convoy on this customers system.  When the domain is unloaded the CLR runtime forces a garbage collection (GC) across all generations to make sure anything related to the domain has been properly cleaned up. During GC (usually) all CLR activity is suspended no matter what application domain. 

Here is what the convey is doing

    1. SPID 50 – Attempted to load application domain, failed and is unloading and performing GC activity.
    2. SPID 51, 52, 53 … 80 are all waiting on the SQL CLR_* protection object and can’t advance until SPID 50 is able to load and initialize or unload and cleanup the application domain.
    3. The time it takes for 50 to complete the unload/GC causes SPID 51, 52, 53, … 80 to timeout.
    4. SPID 50 completes and SPID 51 tries to load the application domain.   The query cancellation is already queued so 51 detects this during initialization and issues the application domain unload again.
    5. SPID 51 completes and SPID 52 tries to load the application domain. …… you get the idea…..

The convey will flood the error log with application domain unload messages and no real work is getting done by the sessions.  The server encounters stop and go behavior as it attempts to load the domain, allow CLR workers to execute, unload the domain and suspend CLR worker activities … repeating the behavior over and over again. 

Solving the problem

The problem is no different than any another other resource bottleneck troubleshooting.  Capturing traces, performance monitor logs and other outputs and tracking down why the original bottleneck occurred.  In this case “Why is the system taking so long to load the assembly or is the query timeout improperly set to something tiny?” 

 

=============== March 4, 2013 Update – RDORR

THE REAL ANSWER

Too many other blogs treated the base error as memory pressure and generally recommended the CLR Enabled setting.

I finally got to the bottom of this specific message and a reproduction.   As you can see the message is NOT stating this is unloaded due to memory pressure or exception it is a different pattern.

2013-01-22 08:07:44.91 spid37s     AppDomain 890288 (mssqlsystemresource.dbo[ddl].890287) unloaded.      

This issue is limited to mssqlsystemresource database.  Stored the system resource database are the base assemblies for SQL Types, Replication CDC, Policy Based Management, MPU management.   (DMFSqlClrWrapper, MPUSqlClrWrapper, asm_tranrepl, asm_tablemaintenance, ...)    You can use the lm or sys.loaded_modules to see which of these may be loaded on your system.

clip_image002[6]

When the sp_configure 'CLR Enabled' is set to 0, only the Microsoft.SqlServer.Types is considered a system type and cached.  All other assemblies will be loaded and unloaded to use/verify them.   This is why you see the flooding of GC activity and messages in the log when you use PBM, MPU, CDC replication, .... and you have CLR Enabled = 0 instead of 1.   When you set the value to 1 (enabled) the assemblies are cached and you avoid the pattern of loading and unloading them.

When disabled the application domain ring buffer will show you (Create, Single User, Unload …) in very close proximity.

The behavior does not prevent execution of these assemblies just the caching capabilities and because they are not cached global GC activity is triggered for every unload

Simple PBM Based Repro
1. Disable sp_configure CLR Enabled   (if you enable this it will use the cached domain and won’t flood the log and repeatedly force global GC operations)

  sp_configure 'clr', 0
  reconfigure

2. Create a Policy (PBM) with a policy of (On change – prevent) so it will use the DMF inside the SQL Server engine.  I created a simple rule that stored procedure names must start with usp%.

• On change: prevent. This automated mode uses DDL triggers to prevent policy violations.

Important
If the nested triggers server configuration option is disabled, On change: prevent will not work correctly. Policy-Based Management relies on DDL triggers to detect and roll back DDL operations that do not comply with policies that use this evaluation mode. Removing the Policy-Based Management DDL triggers or disabling nest triggers, will cause this evaluation mode to fail or perform unexpectedly.

-- Which facets allow On Change: prevent?

select * from msdb..syspolicy_management_facets
  where 1 = execution_mode & 1
  order by name

Create the following policy for procedure naming

Declare @object_set_id int
EXEC msdb.dbo.sp_syspolicy_add_object_set @object_set_name=N'ProcedureName_ObjectSet', @facet=N'StoredProcedure', @object_set_id=@object_set_id OUTPUT
Select @object_set_id

Declare @target_set_id int
EXEC msdb.dbo.sp_syspolicy_add_target_set @object_set_name=N'ProcedureName_ObjectSet', @type_skeleton=N'Server/Database/StoredProcedure', @type=N'PROCEDURE', @enabled=True, @target_set_id=@target_set_id OUTPUT
Select @target_set_id

EXEC msdb.dbo.sp_syspolicy_add_target_set_level @target_set_id=@target_set_id, @type_skeleton=N'Server/Database/StoredProcedure', @level_name=N'StoredProcedure', @condition_name=N'', @target_set_level_id=0
EXEC msdb.dbo.sp_syspolicy_add_target_set_level @target_set_id=@target_set_id, @type_skeleton=N'Server/Database', @level_name=N'Database', @condition_name=N'', @target_set_level_id=0
GO

Declare @policy_id int
EXEC msdb.dbo.sp_syspolicy_add_policy @name=N'ProcedureName', @condition_name=N'ProcedureName', @policy_category=N'', @description=N'', @help_text=N'', @help_link=N'', @schedule_uid=N'00000000-0000-0000-0000-000000000000', @execution_mode=1, @is_enabled=True, @policy_id=@policy_id OUTPUT, @root_condition_name=N'', @object_set=N'ProcedureName_ObjectSet'
Select @policy_id
GO

3. Run the following statement

use dbTest
go

dbcc freeproccache
go

create procedure spTest
as
begin
       select 1
end
go

 

Bob Dorr - Principal SQL Server Escalation Engineer

Comments

  • Anonymous
    March 02, 2015
    We have a situation where appdomain unload occurs even with user defined assemblies with clr enabled = 1. How do we explain this behavior and what are the likely solutions?