แชร์ผ่าน


Beware of Auto-Reset Events (They Don't Behave the Way You Think They Do)

So I just spent several days debugging an intermittent crash in some test code.  Part of the reason it took so long was because whoever wrote this code introduced a dependency on a legacy test component that was apparently so old that sources and symbols had disappeared.  So the first part of the diagnosis involved a bit of reverse engineering, which is always a bit time consuming.  Another reason it took so long was because the crash couldn't reliably be reproduced in the debugger, so I had to start with post-mortem analysis.  Fortunately, windbg is a great tool for doing that sort of thing.

So I reproduce the crash, attach windbg and get started (interestingly enough, Dr. Watson couldn't grab a crash dump...):

*** wait with pending attach
Symbol search path is: *******************
Executable search path is:
ModLoad: 09320000 0932b000   c:\dd\OfficeClient_2\binaries\x86chk\SuiteBin\sdmquery.dll
eax=00000000 ebx=00000001 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
eip=77170bc5 esp=0046e770 ebp=0046e808 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!ZwWaitForMultipleObjects+0x15:
77170bc5 c21400          ret     14h

The first thing to do is dump the stacks of all threads so we can figure out which one crashed.
0:000> ~*k

.  0  Id: c80.ef0 Suspend: 1 Teb: 7efdd000 Unfrozen
ChildEBP RetAddr
0046e76c 753ddcea ntdll!ZwWaitForMultipleObjects+0x15
0046e808 712b549e KERNEL32!WaitForMultipleObjectsEx+0x11d
0046e86c 712b5529 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x56
0046e88c 712b5645 mscorwks!Thread::DoAppropriateAptStateWait+0x4d
0046e918 712b51ed mscorwks!Thread::DoAppropriateWaitWorker+0x14d
0046e984 714337cd mscorwks!Thread::DoAppropriateWait+0x60
*** WARNING: Unable to verify checksum for C:\Windows\assembly\NativeImages_v4.0.10722_32\mscorlib\d3ebaf905742c816f27f122da299ef5f\mscorlib.ni.dll
0046ea80 10699cef mscorwks!WaitHandleNative::CorWaitOneNative+0x14f
0046ea98 10699bc2 mscorlib_ni!System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)+0x1f
0046eab4 05c280c0 mscorlib_ni!System.Threading.WaitHandle.WaitOne(Int32, Boolean)+0x2a
0046eb7c 05c27a40 Motif!MS.Internal.Motif.Runtime.TestElementWorkerThread.Join()+0x30
0046eb7c 026fdd9a Motif!MS.Internal.Motif.Runtime.WorkerThreadInvokeHandler.ElementInvokeHandler(System.Object, MS.Internal.Motif.Runtime.InvokedEventArgs)+0x1f8
0046eb7c 026fdb3e Motif!MS.Internal.Motif.Runtime.TestElement.OnInvoked(MS.Internal.Motif.Runtime.InvokedEventArgs)+0x1d2
0046ecc8 05c2c472 Motif!MS.Internal.Motif.Runtime.TestElement.Invoke()+0x17e
0046ecc8 002312d4 Motif!MS.Internal.Motif.Runtime.TestMethodAction.Execute()+0x212
*** ERROR: Module load completed but symbols could not be loaded for c:\dd\OfficeClient_2\binaries\x86chk\SuiteBin\sth_x86.exe
0046f064 713c2ee9 sth_x86!MS.Internal.Motif.TestHarness.StandaloneTestHarness.Main(System.String[])+0x10d4
0046f074 71256056 mscorwks!CallDescrWorker+0x33
0046f0f0 7125721e mscorwks!CallDescrWorkerWithHandler+0x8e
0046f22c 71257253 mscorwks!MethodDesc::CallDescr+0x19e
0046f248 712572ac mscorwks!MethodDesc::CallTargetWorker+0x21
0046f260 71305db5 mscorwks!MethodDescCallSite::Call_RetArgSlot+0x1c

...

8  Id: c80.98c Suspend: 1 Teb: 7ef40000 Unfrozen
ChildEBP RetAddr
06b3bc34 753d1270 ntdll!NtWaitForSingleObject+0x15
06b3bca4 753d11d8 KERNEL32!WaitForSingleObjectEx+0xbe
06b3bcb8 713acb41 KERNEL32!WaitForSingleObject+0x12
06b3bce8 713ad1ed mscorwks!ClrWaitForSingleObject+0x24
06b3c1a4 713afbcd mscorwks!RunWatson+0x1e3
06b3c8f4 713b01d1 mscorwks!DoFaultReportWorker+0xa46
06b3c934 714130a8 mscorwks!DoFaultReport+0x40
06b3c958 71411f04 mscorwks!WatsonLastChance+0x42
06b3d5f4 71411f7c mscorwks!EEPolicy::LogFatalError+0x420
06b3d60c 71413758 mscorwks!EEPolicy::HandleFatalError+0x38
06b3d630 71413c44 mscorwks!CLRVectoredExceptionHandlerPhase3+0xa6
06b3d664 714145a3 mscorwks!CLRVectoredExceptionHandlerPhase2+0x20
06b3d690 7141466c mscorwks!CLRVectoredExceptionHandler+0xad
06b3d6bc 771aa357 mscorwks!CLRVectoredExceptionHandlerShim+0x6d
06b3d6e8 7718c6c2 ntdll!RtlpCallVectoredHandlers+0x7a
06b3d6fc 7718c754 ntdll!RtlCallVectoredExceptionHandlers+0x12
06b3d774 77172eff ntdll!RtlDispatchException+0x19
06b3d774 712a0364 ntdll!KiUserExceptionDispatcher+0xf
06b3dad4 712a46e5 mscorwks!WKS::CFinalize::ScanForFinalization+0xc8
06b3db14 712a371f mscorwks!WKS::gc_heap::mark_phase+0x2c3

...

17  Id: c80.980 Suspend: 1 Teb: 7ef22000 Unfrozen
ChildEBP RetAddr
0e24fb2c 77217489 ntdll!ZwWaitForMultipleObjects+0x15
0e24fcc8 7544e3f3 ntdll!TppWaiterpThread+0x294
0e24fcd4 771ccfed KERNEL32!BaseThreadInitThunk+0xe
0e24fd14 771cd1ff ntdll!__RtlUserThreadStart+0x23
0e24fd2c 00000000 ntdll!_RtlUserThreadStart+0x1b
Thread #8 is clearly the problem, so we switch the debugger to that thread.
0:000> ~8s
eax=00000000 ebx=753d1661 ecx=0cd965ac edx=00400000 esi=06b3bc7c edi=00000000
eip=7717039d esp=06b3bc38 ebp=06b3bca4 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
ntdll!NtWaitForSingleObject+0x15:
7717039d c20c00          ret     0Ch

Now we dump the full mixed stack so we can get the full picture of what is going on.

0:008> !dumpstack
OS Thread Id: 0x98c (8)
Current frame: ntdll!NtWaitForSingleObject+0x15
ChildEBP RetAddr  Caller,Callee
06b3bc34 753d1270 KERNEL32!WaitForSingleObjectEx+0xbe
...
06b3df6c 0905e3fa (MethodDesc 0x127fe4 +0x52 MS.Internal.Mita.Logging.Log.Fail(System.Object, System.String, System.String))
06b3df94 0905e38d (MethodDesc 0x127fdc +0x2d MS.Internal.Mita.Logging.Log.Fail(System.Object, System.String)), calling (MethodDesc 0x127fe4 +0 MS.Internal.Mita.Logging.Log.Fail(System.Object, System.String, System.String))
06b3dfa4 0905e346 (MethodDesc 0x5cf2c78 +0x6 Microsoft.TrinityQA.Logging.MitaGlobalLog.Fail(System.String)), calling (MethodDesc 0x127fdc +0 MS.Internal.Mita.Logging.Log.Fail(System.Object, System.String))
06b3dfa8 0905e31b (MethodDesc 0x5cf46c0 +0x1b Microsoft.TrinityQA.Logging.SharedLog.Fail(System.String)), calling 09180582
06b3dfb8 0905e29e (MethodDesc 0x5cf30f0 +0x3e Microsoft.TrinityQA.Logging.TrinityLog.Fail(System.String)), calling 09180572
06b3dfec 0905e245 (MethodDesc 0x5cf30f8 +0x2d Microsoft.TrinityQA.Logging.TrinityLog.Fail(System.Exception)), calling (MethodDesc 0x5cf30f0 +0 Microsoft.TrinityQA.Logging.TrinityLog.Fail(System.String))

...
06b3e87c 77173099 ntdll!ExecuteHandler2+0x26
06b3e8a0 7717306b ntdll!ExecuteHandler+0x24 calling ntdll!ExecuteHandler2
06b3e8c4 7718c7ba ntdll!RtlDispatchException+0x124 , calling ntdll!RtlpExecuteHandlerForException
06b3e950 77172eff ntdll!KiUserExceptionDispatcher+0xf, calling ntdll!RtlDispatchException
06b3ec84 09321260 *** WARNING: Unable to verify checksum for c:\dd\OfficeClient_2\binaries\x86chk\SuiteBin\sdmquery.dll
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for c:\dd\OfficeClient_2\binaries\x86chk\SuiteBin\sdmquery.dll -
sdmquery!SDMCALL+0x1c0 ====> Exception Code 255cc2d cxr@6b3e9b8 exr@b3e968
06b3ecb8 76cf7deb USER32!IsWindow+0x54 calling USER32!__SEH_epilog4
06b3ecbc 0011aa60 0011aa60, calling KERNEL32!GetLastError
06b3ecc8 0029874f 0029874f
06b3ecf0 0905e0ef (MethodDesc 0x6f3bf7c +0x1df Maui.Core.SdmWindow.GetDialogInfo()), calling 0034832c
06b3ed10 0905e0ef (MethodDesc 0x6f3bf7c +0x1df Maui.Core.SdmWindow.GetDialogInfo()), calling 0034832c
06b3ed80 0905dead (MethodDesc 0x6f3bf04 +0x35 Maui.Core.SdmWindow.Init()), calling 003482b8
06b3ed94 0905de5d (MethodDesc 0x6f3beec +0x25 Maui.Core.SdmWindow..ctor(System.String, Maui.Core.Utilities.StringMatchSyntax, System.String, Maui.Core.Utilities.StringMatchSyntax, Maui.Core.App, Int32)), calling 00348274
06b3eda0 0905de18 (MethodDesc 0x6f3c2d0 +0x30 Maui.VisualStudio.Current.Vsto.Office12.Dialogs.SdmAddNamedRangeDialog..ctor(Maui.Core.App)), calling 00348264
06b3edc4 0905ddc7 (MethodDesc 0x6f38bf0 +0x27 Maui.VisualStudio.Current.Vsto.Office12.DesignerControls.Excel.NamedRange.OnComponentInitialize()), calling 00348244
06b3edd4 09059799 (MethodDesc 0x92c4a60 +0x2a9 Maui.VisualStudio.Current.Vsto.Designers.VstoDesigner.AddToolBoxComponent(Maui.VisualStudio.Current.Ide.Designers.IToolBoxComponent, Maui.VisualStudio.Current.Ide.Designers.AddMethod, Int32, Int32, Int32, Int32)), calling 09180432
06b3ee08 0905945d (MethodDesc 0x92c4cbc +0x2d Maui.VisualStudio.Current.Vsto.Office12.Designers.WorksheetDesigner.AddToolBoxComponent(Maui.VisualStudio.Current.Ide.Designers.IToolBoxComponent, Maui.VisualStudio.Current.Ide.Designers.AddMethod, Int32, Int32, Int32, Int32)), calling (MethodDesc 0x92c4a60 +0 Maui.VisualStudio.Current.Vsto.Designers.VstoDesigner.AddToolBoxComponent(Maui.VisualStudio.Current.Ide.Designers.IToolBoxComponent, Maui.VisualStudio.Current.Ide.Designers.AddMethod, Int32, Int32, Int32, Int32))
06b3ee2c 09059361 (MethodDesc 0x92c4a50 +0xa9 Maui.VisualStudio.Current.Vsto.Designers.VstoDesigner.AddToolBoxComponent(Maui.VisualStudio.Current.Ide.Designers.IToolBoxComponent, Maui.VisualStudio.Current.Ide.Designers.AddMethod))
06b3ee54 0905927b (MethodDesc 0x6f38ba8 +0x43 Maui.VisualStudio.Current.Vsto.Office12.DesignerControls.Excel.NamedRange..ctor(Maui.VisualStudio.Current.Vsto.Office12.Designers.WorksheetDesigner, Maui.VisualStudio.Current.Ide.Designers.AddMethod))
06b3ee68 09059206 (MethodDesc 0x92c4784 +0x1e Microsoft.TrinityQA.Vsto.Helpers.Project.Excel12DesignerHelper.AddNamedRange(Maui.VisualStudio.Current.Ide.Designers.AddMethod)), calling 0034767c
06b3ee7c 09053dbd (MethodDesc 0x5cfba34 +0x5d Microsoft.TrinityQA.FeatureTesters.Vsto.ExcelDocProjectTester.BF_Debug_Windows()), calling 09180182
06b3ee84 06807dc0 (MethodDesc 0x5cfaedc +0xb0 Microsoft.TrinityQA.FeatureTesters.Vsto.DocProjectTester.RunTest(Microsoft.TrinityQA.FeatureTesters.TestSession.TestCase))
06b3eea4 06807dc0 (MethodDesc 0x5cfaedc +0xb0 Microsoft.TrinityQA.FeatureTesters.Vsto.DocProjectTester.RunTest(Microsoft.TrinityQA.FeatureTesters.TestSession.TestCase))
06b3eedc 06807cf7 (MethodDesc 0x48ed118 +0x4f Microsoft.TrinityQA.FeatureTesters.Vsto.DocProjectEntryPoints.BF_Debug_Windows()), calling (MethodDesc 0x5cfaedc +0 Microsoft.TrinityQA.FeatureTesters.Vsto.DocProjectTester.RunTest(Microsoft.TrinityQA.FeatureTesters.TestSession.TestCase))
06b3eef4 713c2ee9 mscorwks!CallDescrWorker+0x33
06b3ef04 71256056 mscorwks!CallDescrWorkerWithHandler+0x8e calling mscorwks!CallDescrWorker
...

06b3ff40 771cd1ff ntdll!_RtlUserThreadStart+0x1b, calling ntdll!__RtlUserThreadStart
From the stack, we can see that sdmquery!SDMCALL is throwing and there is an exception record (exr).  However, the exception code isn't something I recognize.  Luckily, there is a managed exception on the stack, so if we dump that, we should get a string that explains the problem.
0:008> !dso
OS Thread Id: 0x98c (8)
ESP/REG  Object   Name
06b3c890 028310b4 System.ExecutionEngineException
06b3c8dc 028310b4 System.ExecutionEngineException
06b3db90 0cda8628 System.String    cat="INFO/TEST/FAIL"
06b3dc24 0cf8467c System.String    2
06b3dc40 0cf83ba8 System.String    System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Maui.Core.NativeMethods.BldSdmCall(IntPtr HWnd, Int32 wMsg, Int32 wp, CTRLDEF& lp)

...

06b3f824 029042ac System.Threading.ThreadStart

Ok, now we can see we're looking at an access violation.  WinDbg has a very useful dot command that allows us to switch the context to reflect system state at the time of the exception.  So we'll use the .exr command to do that and see what our state was when we crashed.
0:008> .cxr 6b3e9b8
eax=09530000 ebx=000000a0 ecx=00000249 edx=00001900 esi=09531000 edi=0cd97460
eip=09321260 esp=06b3ec84 ebp=06b3ecc8 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
sdmquery!SDMCALL+0x1c0:
09321260 f3a5            rep movs dword ptr es:[edi],dword ptr [esi]

Now we're getting somewhere.  The access violation occured during a memory copy of some sort.  We can dissasamble from here to get the address of the next instruction.  We'll use that to disassemble backwards so we can see what the code was doing prior to the memory copy.

0:008> u 09321260
sdmquery!SDMCALL+0x1c0:
09321260 f3a5            rep movs dword ptr es:[edi],dword ptr [esi]
09321262 8bca            mov     ecx,edx
09321264 83e103          and     ecx,3
09321267 f3a4            rep movs byte ptr es:[edi],byte ptr [esi]
09321269 eb4d            jmp     sdmquery!SDMCALL+0x218 (093212b8)
0932126b 81f9fd7f0000    cmp     ecx,7FFDh
09321271 7408            je      sdmquery!SDMCALL+0x1db (0932127b)
09321273 81f9fc7f0000    cmp     ecx,7FFCh

Now that we have the next instruction, we can disassemble backwards to see what the function does.
0:008> ub 09321262 l 100
sdmquery+0xf78:
09320f78 0000            add     byte ptr [eax],al
...
0932123c a174873209      mov     eax,dword ptr [sdmquery!SDMCALL+0x76d4 (09328774)]
09321241 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]
09321244 81f9f77f0000    cmp     ecx,7FF7h
0932124a 8b5804          mov     ebx,dword ptr [eax+4]
0932124d 751c            jne     sdmquery!SDMCALL+0x1cb (0932126b)
0932124f 8b7d14          mov     edi,dword ptr [ebp+14h]
09321252 8d0c9b          lea     ecx,[ebx+ebx*4]
09321255 c1e103          shl     ecx,3
09321258 8bd1            mov     edx,ecx
0932125a 8d7024          lea     esi,[eax+24h]
0932125d c1e902          shr     ecx,2
09321260 f3a5            rep movs dword ptr es:[edi],dword ptr [esi]
Ok, we have an idea of what the function is doing, so the next step is to figure out what kind of an acess violation we're getting.  We'll start be checking the address that edi points to.  If we get garbage there, we're looking at a buffer overrun.  However, it looks like edi is pointing to a valid address.
0:008> dd edi
0cd97460  0074006e 0032005f 0073005c 00630072
0cd97470  0073005c 00690075 00650074 00720073
0cd97480  005c0063 00720074 006e0069 00740069
0cd97490  005c0079 00750061 006f0074 0061006d
0cd974a0  00690074 006e006f 0043005c 006d006f
0cd974b0  006f006d 005c006e 0069004c 005c0062
0cd974c0  0061004d 00690075 0056005c 00730069
0cd974d0  00610075 0053006c 00750074 00690064

Next, we'll dump the source pointer.  Clearly, this address is bad.  So at a minimum, we've read off the end of our source buffer while doing the memory copy.  The 64k question is why.
0:008> dd esi
09531000  ???????? ???????? ???????? ????????
09531010  ???????? ???????? ???????? ????????
09531020  ???????? ???????? ???????? ????????
09531030  ???????? ???????? ???????? ????????
09531040  ???????? ???????? ???????? ????????
09531050  ???????? ???????? ???????? ????????
09531060  ???????? ???????? ???????? ????????
09531070  ???????? ???????? ???????? ????????

Let's refer back to the exception context and look at some registers.  ECX is definitely suspect.  We're trying to copy at least 585 bytes (and probably more since we don't yet know what ECX started at).  In any event, the overrun may be caused by a bug in the size calculation, so lets explore that.
eax=09530000 ebx=000000a0 ecx=00000249 edx=00001900 esi=09531000 edi=0cd97460
eip=09321260 esp=06b3ec84 ebp=06b3ecc8 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
sdmquery!SDMCALL+0x1c0:
09321260 f3a5            rep movs dword ptr es:[edi],dword ptr [esi]
Looking at the disassembly, we see that ECX's value is determined by some EBX calculations.  EBX contains an offset of an EAX address.  Luckily we can see that none of these registers have been overwritten so we should be able to examine them.  We'll start with eax+4--which should be the value that ebx got.
0:008> dd eax+4
09530004  00000004 00120ab2 00007ff7 00000002
09530014  0cd96484 76cf7deb 0011aa60 00000001
09530024  00010002 00060000 00000000 00000086
09530034  00000035 000000d2 0000004d 0000011a
09530044  0000ffff ffffffff 00020002 00070000
09530054  00000000 000000da 00000035 00000126
09530064  0000004d 0000011a 0000ffff ffffffff
09530074  00120001 00400000 00000000 00000006

Strange; eax+4 is 0x4, yet ebx is 0x0a.  The only possible way that could happen is if eax+4 got overwritten with a new value *after* it got read into EBX.  This code certainly isn't doing that.  It is worth mentioning though that when this bug does not occur, *both* EAX+4 and EBX are set to 0x4.
0:008> ?ebx
Evaluate expression: 160 = 000000a0

We also see that edx holds the original value that ECX was set to.  IOW it contains the original count.  If we follow the assembly, we can see it jibes with an EBX value of 0xa0.  IOW 0xa0 + (0xa0*4) = 0x320  0x320 * 8 (the shl instruction multiplies by 2^#bits shifted) = 0x1900 which is what we see in edx.

0:008> ?edx
Evaluate expression: 6400 = 00001900

And just for grins we can verify that the starting address for the memory copy is valid.  ESI was originally set to eax+24, which is still valid since EAX hasn't been overwritten.  EAX+24 looks like its valid.

Luckily, before I had to try to reverse engineer the entire function, a collegue came to the rescue with some old source files for the problem component.  With that I was able to figure out the bug--but only after several days of painful debugging.  The problem code seemed to make sense.  It was a mechanism for sending dialog messages cross-process in a synchronous fashion.  To do that, the code would wrap up the message in some shared memory (using a memory mapped file), install a GetMessage hook in the target process, post a custom message to the target process, and then wait for an event to be signalled.  Once the event was signalled, the returned data would be read out of the shared memory buffer.  On the target side, the hook proc would intercept the custom message, unwrap the real message from shared memory, send it to the dialog, write the return values to the shared memory and then signal the event.

Since I couldn't see the problem from just looking at the code, I decided to try to build the module from the sources I had and see if it would still reproduce the bug.  Fortunately it did, so I started by adding a bunch of asserts and trace statements to try to gather more information.  What I could see was that in some cases, the value of the shared memory was changing after the WaitForSingleObject call had returned.  The other interesting thing is that 0x0a is a legitmate value returned by the first call into the offending code.

The first thing that occured to me was that perhaps I was seeing latency in the memory mapped file--i.e. perhaps we were missing a FlushViewOfFile call.  However, closer examination of the code revealed that the memory mapped file was backed by the page file, so there could be no latency.  My next thought was that perhaps there was some re-entrancy occuring, but eventually I was able to rule that out.  Finally, I started looking at the WaitForSingleObject call.  What had to be happening was that somehow the event was getting incorrectly signalled.  Unfortunately, there isn't a way to determine the state of the event, but from looking at the code, it didn't seem possible for the state to ever be wrong.  Finally, in desperation, I decided to change the Event to be a manual reset event.  My hope was that by adding explicit state changes, I might be able to pin-point where the state was being incorrectly set.  Lo and behold, moving to manual reset events fixed the bug.

Now I was really scratching my head, so I started a careful review of the SetEvent and ResetEvent documentation.  And that was where I found the cause of the bug.  According to msdn documentation (https://msdn.microsoft.com/en-us/library/ms686211.aspx), calling SetEvent has the following behavior:

"The state of an auto-reset event object remains signaled until a single waiting thread is released, at which time the system automatically sets the state to nonsignaled. If no threads are waiting, the event object's state remains signaled."

Whoa!  My understanding of auto-reset events was that the reset occured when the wait was satisified.  Clearly this is not the case!

Suddenly, the cause of the problem became crystal clear.  What was happening when the bug occured was this:  When the offending code called PostMessage, the hook proc would grab the message in the target process.  In some cases, the timing would be such that that hook proc would finish its work and signal the event (via SetEvent) before the code in the calling process had reached the WaitForSingleObject call.  In that case, since no thread was (yet) waiting on the event, the event was signaled and the subsequent WaitForSingleObject call would not actually result in a wait, which in turn meant that the event would *not* be automatically reset.  This resulted in the event staying signalled which meant that the next call to WaitForSingleObject would immediately (and incorrectly) return.  As a result, the value in the shared memory block would be read before the hook proc had actually performed the update.  In the bug case, this value would be sufficiently wrong to produce the read access violation on the memory copy.

So the moral of the story is that auto-reset events can be very dangerous.  I think many of us tend to use auto-reset events by default--with an incomplete understanding of how they actually behave.  Because it is virtually impossible to guarantee that the thread to be synchronized is actually waiting when the event is signalled, auto-reset events may not be useful as a synchronization mechanism.  Certainly, the case I've described must be considered. In my opinion, manual-reset events are much safer and are a much better choice when an event is needed for synchronization. By explicitly calling ResetEvent, one can ensure that the reset always happens as expected. This seems like a much safer bet than trying to prove the correctness of code that relies on auto-reset events to perform the reset.

Comments

  • Anonymous
    August 18, 2008
    PingBack from http://blog.a-foton.ru/2008/08/beware-of-auto-reset-events-they-dont-behave-the-way-you-think-they-do/

  • Anonymous
    October 06, 2011
    I think you are misunderstanding the basic concept. You are thinking about SetEvent of auto-reset as two separate cases - if thread is waiting it will reset, if no thread is waiting it will not reset. This is wrong picture. The truth is that SetEvent does not reset any event, ever (and of course there is no such saying in documentation). The opposite, SetEvent singals the event, as stated in the doc . What is resetting an auto-reset event is the completed wait operation. It's as dangerous as any other synchronization routine. "auto-reset events aren't particularly useful as a synchronization mechanism. Thats wrong. Manual reset are not the same. Usually you'll need additional lock to simulate auto-reset behavior, etc. because you can't wait or signal + reset in an atomic operation, while wait and reset of auto-reset is atomic.  That is the reason there is API for such event. "Manual-reset events are much safer and are a much better choice for all-around use. " Thats nonsense. These are different synchronization objects, each for its place, and thinking about all-around use without analyzing particular requirements is a way to such bugs as you've discovered. Its like saying semaphore is much better choice than interlocked, mutex or critical section "for all-around use".

    • Anonymous
      October 04, 2016
      Sorry for the very late reply. Somehow I never saw this comment in 2008. I haven't looked at this blog since I left Microsoft, but I just happened to see it today. At least in 2008, the behavior of the auto-reset event was such that if the event was signaled and no threads were waiting at the time, the event would never be automatically reset. Calling WaitXXX on an auto-reset event would behave differently depending on whether the event was signaled at the time of the call. This behavior was completely unexpected; one would assume that the resolution of the wait call would result in a reset regardless of whether a wait actually occurred. Because of that behavior, I stand by my conclusion that manual reset events are a much safer option because one can guarantee that the reset happens by explicitly calling ResetEvent. By "all around use" I was referring to the general case where one might wish to use an event for synchronization. Based on your comments, I'm going to make a couple of edits to be more precise.
  • Anonymous
    October 22, 2014
    Win32 may or may not have a more complete understanding of the differences but I want to thank you for your post because by changing an AutoResetEvent instance to a ManualResetEvent it fix a problem for me.  Perhaps like you, I don't care the that the wait and reset are atomic.  I care that I don't get an access violation exception and your explanation works for me and my situation.