Debugging a BugCheck 9C
Disclaimer: A BugCheck 9C is a generic catch-all error code which signifies that the hardware has encountered a catastrophic error. The steps I took below helped me diagnose and fix the issue with my system. If you are hitting a similar error, the following may or may not help you, as multiple (and diverse) failures are all represented with this one error code.
Random Crashes
My home PC started arbitrarily rebooting. Most of the time (at first) it would be while playing PC games which can stress the hardware pretty heavily. Although occasionally, the crash would happen during something fairly benign such as checking email in a web browser.
To get more information about the crash, I went looking for the mini-dump which I have configured to be written out when the OS hits a “BSOD”. Unfortunately, the output directory was completely empty, and no memory dumps were to be found anywhere on my hard drive. At this point I was thinking that this crash must be pretty catastrophic in order to completely bypass the kernel’s exception handler. After some mostly fruitless / generic troubleshooting and web browsing I found an obscure reference to the fact that if your Windows page file is not on the same partition as the system OS, then you won’t get crash dumps. I had my page file on its very own partition as a performance enhancement (which will prevent file fragmentation of the data). So (with some amount of bitterness) I moved my page file back to the C drive and waited around for the next crash to happen.
A little while later my box rebooted again, upon coming back up I happily saw I had a crash dump sitting there. Now I could get somewhere. I moved the generated .dmp file onto another computer and cranked up WinDbg (https://www.microsoft.com/whdc/devtools/debugging/default.mspx) so I could take a look at what happened (some spew has been omitted for brevity).
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 9C, {0, f6232050, b2000000, 1040080f}
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )
Followup: MachineOwner
---------
1: kd> !analyze -v
MACHINE_CHECK_EXCEPTION (9c)
A fatal Machine Check Exception has occurred.
Debugging Details:
------------------
NOTE: This is a hardware error. This error was reported by the CPU
via Interrupt 18. This analysis will provide more information about
the specific error. Please contact the manufacturer for additional
information about this error and troubleshooting assistance.
This error is documented in the following publication:
- IA-32 Intel(r) Architecture Software Developer's Manual
Volume 3: System Programming Guide
Bit Mask:
MA Model Specific MCA
O ID Other Information Error Code Error Code
VV SDP ___________|____________ _______|_______ _______|______
AEUECRC| | | |
LRCNVVC| | | |
^^^^^^^| | | |
6 5 4 3 2 1
3210987654321098765432109876543210987654321098765432109876543210
----------------------------------------------------------------
1011001000000000000000000000000000010000010000000000100000001111
VAL - MCi_STATUS register is valid
Indicates that the information contained within the IA32_MCi_STATUS
register is valid. When this flag is set, the processor follows the
rules given for the OVER flag in the IA32_MCi_STATUS register when
overwriting previously valid entries. The processor sets the VAL
flag and software is responsible for clearing it.
UC - Error Uncorrected
Indicates that the processor did not or was not able to correct the
error condition. When clear, this flag indicates that the processor
was able to correct the error condition.
EN - Error Enabled
Indicates that the error was enabled by the associated EEj bit of the
IA32_MCi_CTL register.
PCC - Processor Context Corrupt
Indicates that the state of the processor might have been corrupted
by the error condition detected and that reliable restarting of the
processor may not be possible.
BUSCONNERR - Bus and Interconnect Error BUS{LL}_{PP}_{RRRR}_{II}_{T}_err
These errors match the format 0000 1PPT RRRR IILL
Concatenated Error Code:
--------------------------
_VAL_UC_EN_PCC_BUSCONNERR_F
This error code can be reported back to the manufacturer.
They may be able to provide additional information based upon
this error. All questions regarding STOP 0x9C should be
directed to the hardware manufacturer.
BUGCHECK_STR: 0x9C_GenuineIntel
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: DRIVER_FAULT
LAST_CONTROL_TRANSFER: from e0b89bff to e0bc7deb
STACK_TEXT:
f6232028 e0b89bff 0000009c 00000000 f6232050 nt!KeBugCheckEx+0x1b
f6232154 e0b84c52 f622ed70 00000000 00000000 hal!HalpMcaExceptionHandler+0xdd
f6232154 00000000 f622ed70 00000000 00000000 hal!HalpMcaExceptionHandlerWrapper+0x4a
STACK_COMMAND: kb
SYMBOL_NAME: ANALYSIS_INCONCLUSIVE
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: Unknown_Module
IMAGE_NAME: Unknown_Image
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAILURE_BUCKET_ID: 0x9C_GenuineIntel_ANALYSIS_INCONCLUSIVE
BUCKET_ID: 0x9C_GenuineIntel_ANALYSIS_INCONCLUSIVE
Followup: MachineOwner
---------
BugCheck 9C is the main error code, and the numbers immediately following it provide some additional information depending on your hardware configuration and the event that triggered the problem. Microsoft has the following generic KB article on the 9C errors: https://support.microsoft.com/?kbid=329284, which gives you some techno-speak for "your hardware did something horribly wrong and we have no idea of how to continue."
After doing some additional research on the web it looked like the most common causes of this issue is due to a faulty motherboard, CPU, power supply, or memory.
BIOS
The first thing I did was to turn on the full memory test that runs during boot up. This test ran successfully so I went onto the power supply. My BIOS allows me to monitor the voltage outputs being supplied to the components in the computer, so I let that screen run for a while and watched the values. But so far everything looked good: the 12 volt line was pretty steady; 5 volts looked good, etc.
Diagnostic Tools
The next step was to run some tools to test and stress the various components. I started with Microsoft's Memory Diagnostic (https://oca.microsoft.com/en/windiag.asp) tool which runs off of its own boot disk (so works well even if your current OS is in a bad state). I let this application run through several iterations without any errors detected, so then I went looking for CPU tests. I tried the Stress Prime 2004 test (https://sp2004.fre3.com/) with the same result: no errors. I also tried out SiSoftware's Sandra package (https://www.sisoftware.co.uk/) which will stress and monitor multiple system components. Unfortunately everything I used was pretty inconclusive: I'm still encountered crashes, but no actionable errors had been detected.
At this point the crash frequency had increased to the point where I could no longer keep Windows up and running long enough to do anything useful, so it was time to get a boot disk. Downloading the latest version of the "Ultimate Boot CD" (https://ubcd.sourceforge.net/) also gave me another set of diagnostic tools to try out. But the results were pretty much the same; some tests could run for extended periods of time without any problems, other tests would crash without giving me any useful information.
Opening the Case
Based on my initial tests I had a fairly high confidence level in the power supply and the memory, which left the CPU and the motherboard, so now it was time to open the case up. After disconnecting everything I pulled the side panel off and vacuumed out all the dust and cat hair that had collected over the years. I then went over the motherboard and components with a good light and looked for anything out of the ordinary. One thing that caught my eye was a bloated capacitor (https://blogs.msdn.com/photos/joshpoley/picture6646863.aspx), looking closer it even appeared to be leaking down into another component. While it wasn't as bad as what these guys experienced (https://www.pcstats.com/articleview.cfm?articleID=195), it definitely wasn't going to be making my computer any healthier. There is no direct evidence that the capacitor was the root cause of the BugCheck, but it is enough of a concern to warrant replacing the board.
If I hadn't noticed anything visually, the next step would have been to pull out half of my memory and run the box with one pair then the other to see if I could rule out (for certain) memory failure. But since I had a bad capacitor, the best thing to do was order a new replacement motherboard and test that first. After the time consuming process of completely disassembling my computer system, inserting a new board, and hooking everything back up, I apprehensively powered it all up.
After I was into the OS, it was back to running more tests to see if everything was hooked up correctly and see if I encounter another crash. I also monitored the component temperatures closely to ensure the CPU heat sink and system fans were all working optimally. Luckily I was able to run all weekend without any crashes, so it looks like the motherboard was the culprit of my STOP 9C errors.