How can I find out why the Cluster Resource Monitor dumped? – Access Violation

The purpose of this posting is to provide information about the Windows 2003 Cluster Resource Monitor, along with what happens when it crashes as well as how to debug it to find out why it crashed. But first, we need you to understand what the Cluster Resource Monitor is and does.
This information below, is taken from the Microsoft MSDN site which describes the Cluster Resource Monitor.

  • A Resource Monitor provides a communication, monitoring, and processing layer between the Cluster service and one or more resources. Resource Monitors have the following characteristics:
  • A Resource Monitor always runs in a process separate from the Cluster service. If a resource fails, the Resource Monitor isolates the Cluster service from the effects. If the Cluster service fails, the Resource Monitor allows its resources to shut down gracefully.
  • To work with a resource, a Resource Monitor loads the resource DLL responsible for that resource type into its process.
  • When the Cluster service requests an operation on a resource, the Resource Monitor routes the request to the appropriate entry point function of the resource DLL responsible for the resource. The Resource Monitor performs default processing for some resource operations.
  • A Resource Monitor stores synchronized state data, allowing the Cluster service and resource DLLs to operate asynchronously, checking and updating resource status as needed.
  • A Resource Monitor periodically checks the operational status of all of its resources. For more information on this process, see Resource Failure.

By default, the Cluster service creates one Resource Monitor per node.

As the article states, all cluster resources currently running on one node is in a Resource Monitor Process.  Prior to Windows Server 2003 Service Pack 2, when a Resource Monitor Process crashed it would generate a Mini Memory Dump. A fix include in Service Pack 2 now allows it to generate a Kernel Memory Dump when the Resource Monitor Process crashes. If the Resource Monitor crashes, the system will dump the Resource Monitor Process to a file called RESRCMON.DMP, and create a new instance of the process.  Because it must create a new one, all resources in the monitor are gone and needs to be restarted.  When this occurs, you would see the following entry in the Windows System Event Log.

Event ID:     1146
Source:       ClusSvc
Description:  The cluster resource monitor died unexpectedly, an attempt will be made to restart it

After this, you could also see other resource failures (Event ID: 1069) as well as disk related events such as Lost Delayed Writes, etc.  You would see the disk related events because the disk(s) would be considered down and since there is data in the cache of the HBA, it has nowhere to write it.  Hence, lost delayed writes exist until the disk is brought back online.  For our examples here, we will ignore these disk related events as we will focus on why the Resource Monitor crashed.

There are a couple reasons why a Resource Monitor would crash such as an Access Violation (0xC0000005) or a Deadlock (0xC0000194).  For this blog, we will talk about the Access Violation (0xC0000005).  An Access Violation will occur because a resource tried to do something it wasn’t supposed to or it is having an issue starting up.

Along with the above System Event (Event ID: 1146) where the Resource Monitor died, you will see this in the Cluster Log file.

NOTE:
The Cluster Log will convert times to Greenwich Mean Time (GMT), so you must ensure you do the proper GMT conversion of time to get to the location in the Cluster Log.

00001d6c.00001b60::2008/03/04-05:28:46.114 ERR  [RM] Exception. Code = 0xc0000005, Address = 0x781449D1
00001d6c.00001b60::2008/03/04-05:28:46.114 ERR  [RM] Exception parameters: 0, 0, 1003f, 0
00001d6c.00001b60::2008/03/04-05:28:46.114 INFO [RM] GenerateMemoryDump: Start memory dump to file C:WINDOWSClusterresrcmon.dmp

Now that we see this entry in the log, we should take a look at the Resource Monitor dump to see what caused the failure.  The first thing to examine is the register states, specifically the ESP (stack pointer) value.

0:023> r
eax=01bf0000 ebx=000f7b88 ecx=00000007 edx=7c8285ec esi=000f7b60 edi=000f7bb8
eip=7c8285ec esp=01aed598 ebp=01aed5a8 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
ntdll!KiFastSystemCallRet:
7c8285ec c3              ret

Starting at the stack pointer address 01aed598, we use the dds command to dump the raw stack. We are looking for the value on the stack just below the routine resrcmon!GenerateMemoryDump. It will take several iterations of the dds command to finally get to the value because the call was made much earlier in the stack.

0:023> dds 01aed598
01aed598  00740061 xpsp2res.dll
xpsp2res+0x100061
01aed59c  00720075 xpsp2res+0xe0075
01aed5a0  00730065 xpsp2res+0xf0065
01aed5a4  00610000
*** pages removed ***
01aedddc  0026afd8
01aedde0  0026af28
01aedde4  01aee034
01aedde8  0100e638 resrcmon!GenerateMemoryDump+0x180
01aeddec  ffffffff
01aeddf0  00001d6c
01aeddf4  00000018

The highlighted above is the pointer to the Exception address stack.  Now that we have our value, we will use the kv command with the value 01aee034 to dump out the stack contents.

0:023> kv=01aee034
ChildEBP RetAddr  Args to Child             
01aed628 7c826d2b 77e63eb3 000004e0 00080178 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01aed62c 77e63eb3 000004e0 00080178 6d5b5af6 ntdll!ZwClose+0xc (FPO: [1,0,0])
01aee034 0100e989 01aee300 01003024 00000000 kernel32!CloseHandle+0x59 (FPO: [Non-Fpo])
01aee04c 01008b2c 01aee300 01003024 01aee300 resrcmon!GenerateExceptionReport+0x7e (FPO: [Non-Fpo])
01aee060 76348d17 01aee300 01aee300 01aee080 resrcmon!RmpExceptionFilter+0x14 (FPO: [Non-Fpo])
01aee070 7786d6d2 01aee300 77ecb7c0 01aee2d8 netshell!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])
01aee080 77e761b7 01aee300 00000000 00000000 netman!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])
01aee2d8 77e792a3 01aee300 77e61ac1 01aee308 kernel32!UnhandledExceptionFilter+0x12a (FPO: [Non-Fpo])
01aee2e0 77e61ac1 01aee308 00000000 01aee308 kernel32!BaseThreadStart+0x4a (FPO: [SEH])
01aee308 7c828752 01aee3ec 01aeffdc 01aee408 kernel32!_except_handler3+0x61 (FPO: [Uses EBP] [3,0,7])
01aee32c 7c828723 01aee3ec 01aeffdc 01aee408 ntdll!ExecuteHandler2+0x26
01aee3d4 7c82855e 01ace000 01aee408 01aee3ec ntdll!ExecuteHandler+0x24
01aee3d4 781449d1 01ace000 01aee408 01aee3ec ntdll!KiUserExceptionDispatcher+0xe (FPO: [2,0,0]) (CONTEXT @ 01aee408)
01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851
01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364
01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_endthreadex+0x3b (FPO: [Non-Fpo])
01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_endthreadex+0xc7 (FPO: [Non-Fpo])
01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

Based on the stack above, we have an exception at address 0x01aee300 which we will use to set the failing context.

0:023> dc 0x01aee300
01aee300  01aee3ec 01aee408 01aee32c 7c828752  ……..,…R..|
01aee310  01aee3ec 01aeffdc 01aee408 01aee3c8  …………….
01aee320  01aeff6c 7c828766 01aeffdc 01aee3d4  l…f..|……..
01aee330  7c828723 01aee3ec 01aeffdc 01aee408  #..|…………
01aee340  01aee3c8 77e61a60 00000000 01aee3ec  ….`..w……..
01aee350  01aeffdc 7c8315c2 01aee3ec 01aeffdc  …….|……..
01aee360  01aee408 01aee3c8 77e61a60 00000000  ……..`..w….
01aee370  01aee3ec 00000000 00000000 00000000  …………….

The first DWORD is the Exception Record (0x01aee3ec) and the second DWORD is the Context Record (0x01aee408) that holds our true stack where the problem occurred.  From the Exception Record, we can see it is an Access Violation.

0:023> .exr 0x01aee3ec
ExceptionAddress: 781449d1 (msvcr80!wcslen+0x00000004)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00000000

So we need to jump into the saved context to get the thread that caused the Resource Monitor to crash.

0:023> .cxr 0x01aee408
eax=00000000 ebx=00f15d30 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
eip=781449d1 esp=01aee6d4 ebp=01aee6f8 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
msvcr80!wcslen+0x4:
781449d1 668b08          mov     cx,word ptr [eax]        ds:0023:00000000=????

0:023> kv
  *** Stack trace for last set context – .thread/.cxr resets it
ChildEBP RetAddr  Args to Child             
01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851
01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364
01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_endthreadex+0x3b (FPO: [Non-Fpo])
01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_endthreadex+0xc7 (FPO: [Non-Fpo])
01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

This stack reveals that JohnApp’s .DLL was in the process of a calling one of its threads to do something and caused this problem.  Now we can find out what the specific resource (in case there are multiple) that caused the problem.  In the case of an Access Violation dump, it is going to be a resource that failed or is in the process of coming online.  You can do a search through the threads using ~*kb to find a current resource trying to startup during a resrcmon termination.

0:023> ~*kb
  10  Id: 1d6c.1394 Suspend: 0 Teb: 7fff4000 Unfrozen
ChildEBP RetAddr  Args to Child             
0179f78c 7c827d0b 77e61d1e 00000464 00000000 ntdll!KiFastSystemCallRet
0179f790 77e61d1e 00000464 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0179f800 77e61c8d 00000464 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
0179f814 10002728 00000464 ffffffff 77e61d48 kernel32!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
0179f838 0100a352 0009a6a8 0179f888 0100864d JohnApp!Startup+0x1268
0179f844 0100864d 000a88a0 0179f8a0 0179fa94 resrcmon!Resmon_Terminate+0x14
0179f888 01009c3b 000a88a0 00000000 0179f8a4 resrcmon!RmpOfflineResource+0x2f1
0179f89c 77c80193 00000003 02460246 00000001 resrcmon!s_RmTerminateResource+0x13
0179f8b4 77ce33e1 01009c28 0179fa98 00000001 rpcrt4!Invoke+0x30
0179fcb4 77ce35c4 00000000 00000000 000ef734 rpcrt4!NdrStubCall2+0x299
0179fcd0 77c7ff7a 000ef734 0008f980 000ef734 rpcrt4!NdrServerCall2+0x19
0179fd04 77c8042d 0100c24c 000ef734 0179fdec rpcrt4!DispatchToStubInCNoAvrf+0x38
0179fd58 77c80353 00000005 00000000 01011458 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x11f
0179fd7c 77c7e0d4 000ef734 00000000 01011458 rpcrt4!RPC_INTERFACE::DispatchToStub+0xa3
0179fdbc 77c7e080 000ef734 000ef6ec 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0xc0
0179fdfc 77c812f0 0008fb38 00083498 000efcf8 rpcrt4!LRPC_SCALL::DealWithRequestMessage+0x41e
0179fe20 77c88678 000834d0 0179fe38 0008fb38 rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0x127
0179ff84 77c88792 0179ffac 77c8872d 00083498 rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430
0179ff8c 77c8872d 00083498 00000000 00000000 rpcrt4!RecvLotsaCallsWrapper+0xd
0179ffac 77c7b110 00086890 0179ffec 77e64829 rpcrt4!BaseCachedThreadRoutine+0x9d

With the above information, and focusing on the resrcmon!RmpOfflineResource frame, the parameter 0x000a88a0 is our resource.  Using the dc command, you can confirm that this is the resource.

0:023> dc 000a88a0
000a88a0  63727352 00000001 000a7d90 000a9290  Rsrc…..}……
000a88b0  000b37f0 000ad130 000a8318 000ad0f0  .7..0………..
000a88c0  00001388 0000ea60 10000000 0009a6a8  ….`………..
000a88d0  00000000 00000000 00000000 00000000  …………….
000a88e0  00000000 00000001 10001540 10001910  ……..@…….
000a88f0  10001a00 10002280 100026b0 10002810  …..”…&…(..
000a8900  100027c0 00000000 00000000 100028a0  .’………..(..
000a8910  100029e0 00000003 00000000 0000000c  .)…………..

The DWORDS at offset 10, 14, 18, and 1C will reveal the information to confirm the resource such as its GUID in the registry, the .DLL being used, etc.

0:005> du 0x000ad0f0         <<– Resource
000ad0f0  “Johns Resource”
0:005> du 0x000a8318         <<– GUID 
000a8318  “35a73cba-6096-485e-a227-d4a8d06f”
000a8358  “680a”
0:005> du 0x000ad130        <<– Resource Type
000ad130  “Johns Customer Resource Type”
0:005> du 0x000b37f0         <<– DLL being Used
000b37f0  “johnapp.dll”

Now that you know the resource generating the access violation, you should consult with the vendor of this resource to find out what had happened.  It’s possible they can look at the threads with their symbols or get other information from the event logs, any dump it may create, etc.  It could be the resource has a known problem where an update is needed.

The above steps will take you to where you want and I wanted to show those steps first. In most cases, but not all, when the dump is first opened it could show the exception and the stack that we found above by simply entering the .ecxr command…

0:023> .ecxr
eax=00000000 ebx=00f15d30 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
eip=781449d1 esp=01aee6d4 ebp=01aee6f8 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
msvcr80!wcslen+0x4:
781449d1 668b08          mov     cx,word ptr [eax]        ds:0023:00000000=????

0:023> kb
  *** Stack trace for last set context – .thread/.cxr resets it
ChildEBP RetAddr  Args to Child             
01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4
WARNING: Stack unwind information not available. Following frames may be wrong.
01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851
01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364
01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_callthreadstartex+0x1b
01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_threadstartex+0x66
01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34

You could also get to the same information using the above original steps (dds 01aed598) but stopping at the resrcmon!RmpExceptionFilter (Resource Monitor handles the exception) which has the exception as the first parameter.

0:023> dds 01aed598
01aed598  00740061 xpsp2res.dll
xpsp2res+0x100061
01aed59c  00720075 xpsp2res+0xe0075
01aed5a0  00730065 xpsp2res+0xf0065
01aed5a4  00610000
*** pages removed ***
01aedddc  0026afd8
01aedde0  0026af28
01aedde4  01aee034 
01aedde8  0100e638 resrcmon!GenerateMemoryDump+0x180
01aeddec  ffffffff
01aeddf0  00001d6c
01aeddf4  00000018
*** pages removed ***
01aee000  01005528 resrcmon!`string’+0xc
01aee004  ffffffff
01aee008  0100d27b resrcmon!ClRtlLogPrint+0x499
01aee00c  0100e96c resrcmon!GenerateExceptionReport+0x61
01aee010  00000001
01aee014  01005bf4 resrcmon!`string’
*** pages removed ***
01aee048  7786d687 netman!__CxxUnhandledExceptionFilter
01aee04c  01aee060 
01aee050  01008b2c resrcmon!RmpExceptionFilter+0x14
01aee054  01aee300
01aee058  01003024 resrcmon!`string’
01aee05c  01aee300

From the above:
01aee034 is the pointer to the exception stack
01aee060 is frame 3 in the kv=01aee034 further above
01aee300 is frame 3 in the kv=01aee034 further above