Jaa


Debugging a BugCheck 9C

Disclaimer: A BugCheck 9C is a generic catch-all error code which signifies that the hardware has encountered a catastrophic error. The steps I took below helped me diagnose and fix the issue with my system. If you are hitting a similar error, the following may or may not help you, as multiple (and diverse) failures are all represented with this one error code.

Random Crashes

My home PC started arbitrarily rebooting. Most of the time (at first) it would be while playing PC games which can stress the hardware pretty heavily. Although occasionally, the crash would happen during something fairly benign such as checking email in a web browser.

 

To get more information about the crash, I went looking for the mini-dump which I have configured to be written out when the OS hits a “BSOD”. Unfortunately, the output directory was completely empty, and no memory dumps were to be found anywhere on my hard drive. At this point I was thinking that this crash must be pretty catastrophic in order to completely bypass the kernel’s exception handler. After some mostly fruitless / generic troubleshooting and web browsing I found an obscure reference to the fact that if your Windows page file is not on the same partition as the system OS, then you won’t get crash dumps. I had my page file on its very own partition as a performance enhancement (which will prevent file fragmentation of the data). So (with some amount of bitterness) I moved my page file back to the C drive and waited around for the next crash to happen.

 

A little while later my box rebooted again, upon coming back up I happily saw I had a crash dump sitting there. Now I could get somewhere. I moved the generated .dmp file onto another computer and cranked up WinDbg (https://www.microsoft.com/whdc/devtools/debugging/default.mspx) so I could take a look at what happened (some spew has been omitted for brevity).

 

*******************************************************************************

* *

* Bugcheck Analysis *

* *

*******************************************************************************

 

Use !analyze -v to get detailed debugging information.

 

BugCheck 9C, {0, f6232050, b2000000, 1040080f}

 

Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )

 

Followup: MachineOwner

---------

 

1: kd> !analyze -v

 

MACHINE_CHECK_EXCEPTION (9c)

A fatal Machine Check Exception has occurred.

 

Debugging Details:

------------------

   NOTE: This is a hardware error. This error was reported by the CPU

   via Interrupt 18. This analysis will provide more information about

   the specific error. Please contact the manufacturer for additional

   information about this error and troubleshooting assistance.

 

   This error is documented in the following publication:

 

      - IA-32 Intel(r) Architecture Software Developer's Manual

        Volume 3: System Programming Guide

 

   Bit Mask:

 

       MA Model Specific MCA

    O ID Other Information Error Code Error Code

   VV SDP ___________|____________ _______|_______ _______|______

   AEUECRC| | | |

   LRCNVVC| | | |

   ^^^^^^^| | | |

      6 5 4 3 2 1

   3210987654321098765432109876543210987654321098765432109876543210

   ----------------------------------------------------------------

   1011001000000000000000000000000000010000010000000000100000001111

 

 

VAL - MCi_STATUS register is valid

        Indicates that the information contained within the IA32_MCi_STATUS

        register is valid. When this flag is set, the processor follows the

        rules given for the OVER flag in the IA32_MCi_STATUS register when

        overwriting previously valid entries. The processor sets the VAL

        flag and software is responsible for clearing it.

 

UC - Error Uncorrected

        Indicates that the processor did not or was not able to correct the

        error condition. When clear, this flag indicates that the processor

        was able to correct the error condition.

 

EN - Error Enabled

        Indicates that the error was enabled by the associated EEj bit of the

        IA32_MCi_CTL register.

 

PCC - Processor Context Corrupt

        Indicates that the state of the processor might have been corrupted

        by the error condition detected and that reliable restarting of the

        processor may not be possible.

 

BUSCONNERR - Bus and Interconnect Error BUS{LL}_{PP}_{RRRR}_{II}_{T}_err

        These errors match the format 0000 1PPT RRRR IILL

 

 

 

   Concatenated Error Code:

   --------------------------

   _VAL_UC_EN_PCC_BUSCONNERR_F

 

   This error code can be reported back to the manufacturer.

   They may be able to provide additional information based upon

   this error. All questions regarding STOP 0x9C should be

   directed to the hardware manufacturer.

 

 

BUGCHECK_STR: 0x9C_GenuineIntel

 

CUSTOMER_CRASH_COUNT: 1

 

DEFAULT_BUCKET_ID: DRIVER_FAULT

 

LAST_CONTROL_TRANSFER: from e0b89bff to e0bc7deb

 

STACK_TEXT:

f6232028 e0b89bff 0000009c 00000000 f6232050 nt!KeBugCheckEx+0x1b

f6232154 e0b84c52 f622ed70 00000000 00000000 hal!HalpMcaExceptionHandler+0xdd

f6232154 00000000 f622ed70 00000000 00000000 hal!HalpMcaExceptionHandlerWrapper+0x4a

 

 

STACK_COMMAND: kb

 

SYMBOL_NAME: ANALYSIS_INCONCLUSIVE

 

FOLLOWUP_NAME: MachineOwner

 

MODULE_NAME: Unknown_Module

 

IMAGE_NAME: Unknown_Image

 

DEBUG_FLR_IMAGE_TIMESTAMP: 0

 

FAILURE_BUCKET_ID: 0x9C_GenuineIntel_ANALYSIS_INCONCLUSIVE

 

BUCKET_ID: 0x9C_GenuineIntel_ANALYSIS_INCONCLUSIVE

 

Followup: MachineOwner

---------

 

BugCheck 9C is the main error code, and the numbers immediately following it provide some additional information depending on your hardware configuration and the event that triggered the problem. Microsoft has the following generic KB article on the 9C errors: https://support.microsoft.com/?kbid=329284, which gives you some techno-speak for "your hardware did something horribly wrong and we have no idea of how to continue."

 

After doing some additional research on the web it looked like the most common causes of this issue is due to a faulty motherboard, CPU, power supply, or memory.

BIOS

The first thing I did was to turn on the full memory test that runs during boot up. This test ran successfully so I went onto the power supply. My BIOS allows me to monitor the voltage outputs being supplied to the components in the computer, so I let that screen run for a while and watched the values. But so far everything looked good: the 12 volt line was pretty steady; 5 volts looked good, etc.

Diagnostic Tools

The next step was to run some tools to test and stress the various components. I started with Microsoft's Memory Diagnostic (https://oca.microsoft.com/en/windiag.asp) tool which runs off of its own boot disk (so works well even if your current OS is in a bad state). I let this application run through several iterations without any errors detected, so then I went looking for CPU tests. I tried the Stress Prime 2004 test (https://sp2004.fre3.com/) with the same result: no errors. I also tried out SiSoftware's Sandra package (https://www.sisoftware.co.uk/) which will stress and monitor multiple system components. Unfortunately everything I used was pretty inconclusive: I'm still encountered crashes, but no actionable errors had been detected.

 

At this point the crash frequency had increased to the point where I could no longer keep Windows up and running long enough to do anything useful, so it was time to get a boot disk. Downloading the latest version of the "Ultimate Boot CD" (https://ubcd.sourceforge.net/) also gave me another set of diagnostic tools to try out. But the results were pretty much the same; some tests could run for extended periods of time without any problems, other tests would crash without giving me any useful information.

Opening the Case

Based on my initial tests I had a fairly high confidence level in the power supply and the memory, which left the CPU and the motherboard, so now it was time to open the case up. After disconnecting everything I pulled the side panel off and vacuumed out all the dust and cat hair that had collected over the years. I then went over the motherboard and components with a good light and looked for anything out of the ordinary. One thing that caught my eye was a bloated capacitor (https://blogs.msdn.com/photos/joshpoley/picture6646863.aspx), looking closer it even appeared to be leaking down into another component. While it wasn't as bad as what these guys experienced (https://www.pcstats.com/articleview.cfm?articleID=195), it definitely wasn't going to be making my computer any healthier. There is no direct evidence that the capacitor was the root cause of the BugCheck, but it is enough of a concern to warrant replacing the board.

 Bad Capacitor

 

If I hadn't noticed anything visually, the next step would have been to pull out half of my memory and run the box with one pair then the other to see if I could rule out (for certain) memory failure. But since I had a bad capacitor, the best thing to do was order a new replacement motherboard and test that first. After the time consuming process of completely disassembling my computer system, inserting a new board, and hooking everything back up, I apprehensively powered it all up.

 

After I was into the OS, it was back to running more tests to see if everything was hooked up correctly and see if I encounter another crash. I also monitored the component temperatures closely to ensure the CPU heat sink and system fans were all working optimally. Luckily I was able to run all weekend without any crashes, so it looks like the motherboard was the culprit of my STOP 9C errors.