Case Study: Debug distributed application using packet analyzer
Packet analyzers are powerful tools to troubleshoot remote communications between distributed applications. Microsoft are creating various new frameworks to lower the bar/cost for the developers to build distributed applications in Windows ecosystem, while those new technologies introduces additional layers and abstractions on top of existing APIs. And sometimes, if certain error happens in the lower stacks, the error information could be easily lost/translated by layers of wrapper codes in higher stacks, and this could make debugging such error extremely difficult. By using packet analyzers, one can see the raw information being transferred over the wire; therefore, this bottom-up approach could give developers a better understanding of what actually is going on first, and then, knowing what it is would make debugging in higher level more efficient later on.
The Bug:
The following C# code segment is to obtain the hardware plug-and-play information from a remote machine:
ConnectionOptions options = new ConnectionOptions(
null,
"administrator",
"!!123abc",
"NTLMDOMAIN:10.177.17.6",
ImpersonationLevel.Impersonate,
AuthenticationLevel.PacketPrivacy,
true,
null,
TimeSpan.MaxValue);
ManagementScope scope = new ManagementScope(
@"\\10.177.17.6\root\cimv2",
options);
scope.Connect();
ManagementClass targetClass = new ManagementClass(
scope,
new ManagementPath("Win32_PnPEntity"),
null);
ManagementObjectCollection collection = targetClass.GetInstances();
foreach (ManagementBaseObject wmiObject in collection)
{
...
}
The above codes use WMI .NET with 3.5 SP1 runtime. It connect to the target machine (10.177.17.6) using that machine’s local administrator account, and it also specifies NTLM as the authentication protocol (NTLMDOMAIN). After the connection is established, it remotely enumerates all the instances from WMI Win32_PnPEntity class.
This seemingly simple code works fine in most of time; however, in one particular situation: if the target machine is running under Windows Preinstallation Environment 3.0 , the code throws an COMException at some random time during the foreach loop. And the exception code can be either RPC_S_CALL_FAILED_DNE (0x800706BF The remote procedure call failed) or RPC_S_CALL_FAILED (0x800706BE The remote procedure call failed and did not execute).
The Debugging:
Firstly, some knowledge about how Microsoft technologies are layered can help to understand why the exception talks about RPC failure and why it is the type of COMException:
The WMI .NET is just a wrapper above native WMI framework, and the remote communication feature of WMI is built on top of DCOM , and then again, DCOM utilize MSRPC as its inter-process communication infrastructure. So this exception is basically trying to tell the developers that, there is an error on the RPC level detected by DCOM stack, and then that error is reported to the .NET framework layer, and lastly .NET framework raises that error as COMException back to the user code.
If Windbg is attached to the codes, the output shows that error 0x80070721 and 0x80070005 always happened before that COMException
(dd8.a6c): Unknown exception - code 80070721 (first chance)
(dd8.a6c): Unknown exception - code 80070005 (first chance)
(dd8.a6c): Unknown exception - code 800706bf (first chance)
0x8007005 is access denied, and 0x80070721 is less known, use !error command
0:000> !error 0x80070721
Error code: (HRESULT) 0x80070721 (2147944225) - A security package specific error occurred.
It looks like some security related problem has happened right before the client codes failed. That’s all the information there, and nothing furthermore can be investigated in the debugger (unless some serious reverse engineering are done, which is beyond this case study). It seems the debugging is getting into a dead-end, and now it is the time to start the bottom-up approach.
Note: Another possible approach is to setup the debugger to use symbols from Microsoft Reference Source Server, which would provide source codes access for .NET framework.
Use any packet analyzers (such as WireShark or Microsoft Network Monitor ) and start a capture session, and repro this bug again, then save the capture file. The content of the capture file is shown below
No. Time Source Destination Protocol Info
406 5.078086 10.177.17.15 10.177.17.6 DCERPC Bind: call_id: 3 ISystemActivator V0.0, NTLMSSP_NEGOTIATE
407 5.078721 10.177.17.6 10.177.17.15 DCERPC Bind_ack: call_id: 3, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
408 5.078987 10.177.17.15 10.177.17.6 DCERPC AUTH3: call_id: 3, NTLMSSP_AUTH, User: 10.177.17.6\administrator
At the beginning of our capture file, the RPC layer established TCP connection and started to authenticate with the remote machine (10.177.17.6). The first 3 frames (highlighted in gray) are the standard 3-leg NTLM authentication protocol, NEGOTIATE –> CHALLENGE –> AUTH, see [PWS00] chapter 7 for details. As recorded in frame 408 (AUTH), the credential used for authentication is the local administrator account on the target machine. Everything is as expected.
409 5.079045 10.177.17.15 10.177.17.6 ISystemActivator RemoteCreateInstance request[Malformed Packet]
411 5.081677 10.177.17.6 10.177.17.15 ISystemActivator RemoteCreateInstance response[Malformed Packet]
...
517 5.334018 10.177.17.15 10.177.17.6 DCERPC Request: call_id: 7 opnum: 6 ctx_id: 0 IEnumWbemClassObject V0
518 5.334248 10.177.17.6 10.177.17.15 TCP 49157 > 59916 [ACK] Seq=221 Ack=789 Win=130560 Len=0
519 5.334574 10.177.17.6 10.177.17.15 DCERPC Response: call_id: 7 ctx_id: 0 IEnumWbemClassObject V0
...
After the initial authentication, the RPC conversation is established, and the DCOM starts to do its own stuff, such as remote activation (frame 409 - 411), reference count keeping, object lifecycle management etc., and then it is followed by WMI communication, such as IEnumWbemClassObject::Clone (frame 517 – 519, opnum 6 is Clone) and etc.. Most of those packages are about standard operations and have no interests to this bug, so they are not shown in this blog.
Note: see [EC98] and [MS-WMI] for more information about DCOM and WMI
Then, what happened next is most interesting
520 5.334778 10.177.17.15 10.177.17.6 DCERPC Alter_context: call_id: 8, 2 context items, 1st IRemUnknown2 V0.0, NTLMSSP_NEGOTIATE
521 5.334891 10.177.17.6 10.177.17.15 DCERPC Alter_context_resp: call_id: 8, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
522 5.335016 10.177.17.15 10.177.17.6 DCERPC AUTH3: call_id: 8, NTLMSSP_AUTH, User: REDMOND\jiangyue
523 5.335047 10.177.17.15 10.177.17.6 IRemUnknown2 RemQueryInterface request IID[1]=IManagedObject
524 5.335231 10.177.17.6 10.177.17.15 TCP 49157 > 59916 [ACK] Seq=821 Ack=1543 Win=129792 Len=0
525 5.335313 10.177.17.6 10.177.17.15 DCERPC Fault: call_id: 8 ctx_id: 1 status: Unknown (0x00000721)
526 5.335379 10.177.17.15 10.177.17.6 IRemUnknown2 RemQueryInterface request IID[1]=IProvideClassInfo
527 5.335574 10.177.17.6 10.177.17.15 DCERPC Fault: call_id: 9 ctx_id: 1 status: nca_s_fault_access_denied
528 5.335574 10.177.17.6 10.177.17.15 TCP 49157 > 59916 [FIN, ACK] Seq=885 Ack=1687 Win=131328 Len=0
529 5.335600 10.177.17.15 10.177.17.6 TCP 59916 > 49157 [ACK] Seq=1687 Ack=886 Win=130304 Len=0
530 5.336227 10.177.17.15 10.177.17.6 TCP 59916 > 49157 [FIN, ACK] Seq=1687 Ack=886 Win=130304 Len=0
531 5.336356 10.177.17.6 10.177.17.15 TCP 49157 > 59916 [ACK] Seq=886 Ack=1688 Win=131328 Len=0
As highlighted by yellow color, RPC decided to re-authenticate with target machine using a different set of credential redmond\jiangyue (frame 520 - 522), and this credential is belong to the security token of current calling thread. However, redmond\jiangyue is not a valid credential on the target machine. Therefore, when the client machine called QueryInterface on the remote object for IManagedObject, the RPC layer on the remote machine responded with Fault message containing error code 0x00000721 (frame 525). Then again, the client tried to make another request for IProviderClassInfo to remote PRC layer, this time it responded with another Fault message containing error code 0x00000005 (access_denied). Those two errors matched the two exceptions seen by Windbg earlier on. After received two consecutive failed attempt, the RPC layer on the remote machine decided to send out the FIN message and close the current TCP connection, as indicated in frame 528 – 531.
532 5.336862 10.177.17.15 10.177.17.6 TCP 59917 > 49157 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
533 5.336943 10.177.17.6 10.177.17.15 TCP 49157 > 59917 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
534 5.337055 10.177.17.15 10.177.17.6 TCP 59917 > 49157 [ACK] Seq=1 Ack=1 Win=131328 Len=0
535 5.337389 10.177.17.15 10.177.17.6 DCERPC Bind: call_id: 11, 2 context items, 1st IEnumWbemClassObject V0.0, NTLMSSP_NEGOTIATE
536 5.337595 10.177.17.6 10.177.17.15 DCERPC Bind_ack: call_id: 11, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
537 5.337729 10.177.17.15 10.177.17.6 DCERPC AUTH3: call_id: 11, NTLMSSP_AUTH, User: 10.177.17.6\administrator
538 5.337755 10.177.17.15 10.177.17.6 DCERPC Request: call_id: 11 opnum: 3 ctx_id: 0 IEnumWbemClassObject V0
539 5.337971 10.177.17.6 10.177.17.15 TCP 49157 > 59917 [ACK] Seq=221 Ack=789 Win=130560 Len=0
540 5.338106 10.177.17.6 10.177.17.15 DCERPC Response: call_id: 11 ctx_id: 0 IEnumWbemClassObject V0
541 5.338385 10.177.17.15 10.177.17.6 DCERPC Alter_context: call_id: 12, 2 context items, 1st IRemUnknown2 V0.0
542 5.338450 10.177.17.6 10.177.17.15 DCERPC Alter_context_resp: call_id: 12 Provider rejection, reason: Proposed transfer syntaxes not supported
543 5.338491 10.177.17.15 10.177.17.6 IRemUnknown2 RemQueryInterface request
544 5.338605 10.177.17.6 10.177.17.15 IRemUnknown2 RemQueryInterface response
The client seemed ignore all the security errors and continue its execution. The client made a new TCP connection to the remote machine by sending a new SYN message, as indicated in frame 532 – 534. And this time, it uses the right credential 10.177.17.6\administrator so it was enable to make some progress, until again the RPC layer decided to re-authenticate with a wrong credential redmond\jiangyue to query IManagedObject interface. This pattern repeated itself again and again in the captured file.
The client was struggling to move forward until it reached the foreach loop, where it needed to call MoveNext() on the enumerator repeatedly.
745 5.338889 10.177.17.15 10.177.17.6 DCERPC Alter_context: call_id: 13 IWbemFetchSmartEnum V0.0, NTLMSSP_NEGOTIATE
746 5.338996 10.177.17.6 10.177.17.15 DCERPC Alter_context_resp: call_id: 13, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
747 5.339116 10.177.17.15 10.177.17.6 DCERPC AUTH3: call_id: 13, NTLMSSP_AUTH, User: REDMOND\jiangyue
748 5.339146 10.177.17.15 10.177.17.6 DCERPC Request: call_id: 13 opnum: 3 ctx_id: 3 IWbemFetchSmartEnum V0
749 5.339337 10.177.17.6 10.177.17.15 TCP 49157 > 59917 [ACK] Seq=685 Ack=1759 Win=131328 Len=0
750 5.339415 10.177.17.6 10.177.17.15 DCERPC Fault: call_id: 13 ctx_id: 3 status: Unknown (0x00000721)
756 5.339866 10.177.17.6 10.177.17.15 DCERPC Response: call_id: 14 ctx_id: 0 IEnumWbemClassObject V0
757 5.339887 10.177.17.15 10.177.17.6 TCP 59917 > 49157 [ACK] Seq=1871 Ack=5101 Win=131328 Len=0
758 5.340308 10.177.17.15 10.177.17.6 DCERPC Request: call_id: 15 opnum: 3 ctx_id: 3 IWbemFetchSmartEnum V0
759 5.340479 10.177.17.6 10.177.17.15 DCERPC Fault: call_id: 15 ctx_id: 3 status: nca_s_fault_access_denied
760 5.340500 10.177.17.6 10.177.17.15 TCP 49157 > 59917 [FIN, ACK] Seq=5133 Ack=1967 Win=131072 Len=0
761 5.340515 10.177.17.15 10.177.17.6 TCP 59917 > 49157 [ACK] Seq=1967 Ack=5134 Win=131328 Len=0
762 5.340568 10.177.17.15 10.177.17.6 TCP 59917 > 49157 [FIN, ACK] Seq=1967 Ack=5134 Win=131328 Len=0
763 5.340651 10.177.17.6 10.177.17.15 TCP 49157 > 59917 [ACK] Seq=5134 Ack=1968 Win=131072 Len=0
As highlighted above in yellow, after the client enter the foreach loop, it sent out the IWbemFetchSmartEnum::GetSmartEnum request (in frame 747, opnum 3) using the wrong credential once again, and it resulted the remote RPC to drop the TCP connection. This was the last user actions. After that, DCOM started garbage collections (frame 842-845) and then the communication over the wire was completely stopped.
836 6.612612 10.177.17.15 10.177.17.6 TCP 59918 > 49157 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
837 6.612861 10.177.17.6 10.177.17.15 TCP 49157 > 59918 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
838 6.613020 10.177.17.15 10.177.17.6 TCP 59918 > 49157 [ACK] Seq=1 Ack=1 Win=131328 Len=0
839 6.613592 10.177.17.15 10.177.17.6 DCERPC Bind: call_id: 17, 3 context items, 1st IRemUnknown2 V0.0, NTLMSSP_NEGOTIATE
840 6.614019 10.177.17.6 10.177.17.15 DCERPC Bind_ack: call_id: 17, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
841 6.614199 10.177.17.15 10.177.17.6 DCERPC AUTH3: call_id: 17, NTLMSSP_AUTH, User: 10.177.17.6\administrator
842 6.614236 10.177.17.15 10.177.17.6 IRemUnknown2 RemRelease request
843 6.614429 10.177.17.6 10.177.17.15 TCP 49157 > 59918 [ACK] Seq=245 Ack=881 Win=130560 Len=0
844 6.614761 10.177.17.6 10.177.17.15 IRemUnknown2 RemRelease response
845 6.616079 10.177.17.15 10.177.17.6 TCP 59918 > 49157 [RST, ACK] Seq=881 Ack=309 Win=0 Len=0
Frame 845 is the last record in the capture file.
After walking through the capture file, the root cause of the mysterious COMException should be clear now: Some code in WMI .NET layer was connect to remote RPC with a wrong credential, which makes the RPC server (on the remote machine) to drop the TCP connection. While the foreach loop was not aware of this, and kept reusing the dead TCP connection to make requests. When that happens, RPC reports generic error 0x6BE or 0x6BF
Now the next question is that why this issue only happens if the remote machine is running under WinPE environment, not in other full OS version. To answer that, another capture is made for the network traffic with the same codes running against a remote machine under full OS. By analyzing the new capture, it turns out that in full OS scenario, the RCP server behaves differently: It does NOT terminate the TCP connection when wrong credential is used.
With the knowledge obtained so far, do a search on Bing.com about “WMI.NET IManagedobject Current Security Token”, and it would reveal two KB articles from Microsoft (KB967622 and KB974893). They talked about exact the same issue faced here: When WMI.NET was using DCOM wrapper layer provided by .NET framework, the .NET framework is always trying to query IManagedObject interface for any COM object returned. However, .NET framework does not set the security blanket properly before issuing such a request. Without the security blanket being set, the request would use the default security context, i.e. the current security token of that process. This problem is annoying at least, because it would generate bunch of “Failure Audit” events on the remote machine, which causes security concerns, or in this WinPE case, it would actually disrupt the normal code execution.
Note: see [PWS00] chapter 9 for more information about SetSecurityBlanket
The same KB articles provides hotfix from Microsoft to fix this bug for earlier version of .NET. From .NET 4.0 onwards, this fix has already been integrated. After installing the hotfix for .NET 3.5 SP1, the same codes run smoothly against a WinPE machine. By observing the network traffic using package analyzer, it can be confirmed that the remote QueryInterface for IManagedObject requests are not sent over the wire anymore.
Take Away:
Coding in managed environment has greatly simplified the developer’s life. While enjoying the mighty of modern programming techniques, it is also very helpful to understand the history of Microsoft technologies and the internal working of such technologies. To understand what happens under the hood, it is not only giving the developers the confidence when facing bugs, but also giving out examples how ones could design/implement to solve similar issues. Similarly, to know the path of how technologies evolve, it gives the developers the insight to those technologies’ rising and decline along with paradigm shift in this industry.
Package analyzers are efficient tools not only for traditional RPC based intranet applications, it is useful as well for newly Internet/Cloud based program. Having basic understanding of some popular network protocol, like TCP and HTTP, can greatly help to solve bugs. Therefore, It is well worthy to invest time to learn them. More inter-process communication debugging tips and expert advises can be found at [AWD07] Chapter 8.
References:
1. [PWS00] Programming Windows Security
3. [MS-WMI] Windows Management Instrumentation Remote Protocol Specification
4. [AWD07] Advanced Windows Debugging
Special thanks to Gabriel and Pradeep’s help in solving this bug in the first place