다음을 통해 공유


Interpreting a WHEA error for a MCA fault

Howdy fellow debuggers! This is Graham McIntyre, I am an Escalation Engineer in Platforms Global Escalation Services.  We get questions from time to time from customers who experience a WHEA bugcheck 0x124, or system event, for help in interpreting the error record. The information applies to Windows Server 2008 / Vista SP1 and Windows 2008 R2 / Windows 7.

 

I thought I would go through an example error record, point out some commonly asked questions, and show you how to find specific information on the error.  In many cases, the information is specific to a particular processor / hardware vendor, the customer will need to follow up with them. But, we can help to some extent to parse the data.

 

For an initial primer on WHEA and hardware error reporting, I suggest reading this whitepaper:

https://www.microsoft.com/whdc/system/pnppwr/WHEA/wheaintro.mspx

 

I’ll provide some further links to some specific WHEA information along the way.

 

Getting Started:

A WHEA bug check 0x124, WHEA_UNCORRECTABLE_ERROR, indicates that a fatal hardware error has occurred.  The bug check parameters give you further information on the WHEA error record generated.

 

In this example case, the first parameter was 0 so this indicates that this is a Machine Check Exception (MCE).  An MCE is generated by certain classes of processors, such as Intel and AMD 64-bit processors.

 

Checking thehelp included with the Debugging Tools For Windows for Bug Ch 0x124 shows this meaning for the parameters:

Parameter 1 Parameter 2 Parameter 3 Parameter 4 Cause of Error

0x0

Address of WHEA_ERROR_RECORD structure

High 32 bits of MCi_STATUS MSR for the MCA bank that has the error.

Low 32 bits of MCi_STATUS MSR for the MCA bank that has the error.

A machine check exception occurred. These parameter descriptions apply if the processor is based on the x64 architecture, or the x86 architecture that has the MCA feature available (for example, Intel Pentium Pro, Pentium IV, or Xeon).

 

There are 2 useful debugger commands for debugging a WHEA error:

!whea – displays top level WHEA information

!errrec – dumps a specific WHEA error record

 

Since we already have an address of the error record in Parameter 2, we can dump it out directly with !errrec. 

31: kd> !errrec fffffa8064341028
===============================================================================
Common Platform Error Record @ fffffa8064341028
-------------------------------------------------------------------------------
Record Id     : 01cb65718c829130
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type : Machine Check Exception
Timestamp     : 10/11/2010 7:11:22
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa80643410a8
Section       @ fffffa8064341180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Micro-Architectural Error
Flags         : 0x00
CPU Version   : 0x00000000000206e6
Processor ID : 0x0000000000000037

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa80643410f0
Section       @ fffffa8064341240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000037
CPU Id        : e6 06 02 00 00 08 2037 - bd e3 bc 00 ff fb eb bf
            00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa8064341240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa8064341138
Section       @ fffffa80643412c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error : Internal unclassified (Proc 31 Bank 5)
Status      : 0xfa00000000400405

 

As you can see from the output, a WHEA error record is made of several sections.  Each section is actually a sub-section of the one above it. The sections go from most generic, to most specific, based on the exact type of error which occurred.

CPER / WHEA record – this is defined in Appendix N of the UEFI spec version 2.2 (these can be obtained from www.uefi.org)

The format of most of the sections is defined in the UEFI Spec version 2.2 as part of the Common Platform Error Record (CPER) definition.  The last section decribes a Machine Check Archtecture (MCA) which is defined by the processor manufacturer.  In this case, it is an Intel processor

MCA information - The format of the last part of the record (MCA) is defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
Section 15 describes the MCA format and structure. Appendix E in Volume 3B has additional details on interpreting Machine-Check error codes

Let’s take a look at what each of the sections represents:

An error record is described by a WHEA_ERROR_RECORD structure, the error record header is described by a WHEA_ERROR_RECORD_HEADER structure, and the error record section descriptors are each described by a WHEA_ERROR_RECORD_SECTION_DESCRIPTOR structure.

The CPER record header is a WHEA_ERROR_PACKET_V2, and describes the severity and type of error.  In this case it is a fatal Machine Check Exception (MCE)

Section 0 is a Generic Processor error. This error record section contains processor error data that is not specific to a particular processor architecture. The data that is contained in this section is described by the WHEA_PROCESSOR_GENERIC_ERROR_SECTION structure.

Section 1 is an x86/x64 Processor Error. This error record section contains processor error data that is specific to the x86 or x64 processor architecture. The data that is contained in this section is described by the WHEA_XPF_PROCESSOR_ERROR_SECTION structure.

Section 2 is of type WHEA_XPF_MCA_SECTION and contains the machine check and other machine-specific register information. The actual structure which holds the MCA data is a Microsoft specific extension of the CPER specification.  We build this record by reading the Machine Specific Registers (MSRs) which are processor specific, and filling in the fields.  These (and many of the above) are defined in the header file cper.h in the SDK.

Some of the questions which I was asked about this record, and their answers:

1. Why is the processor number (31) listed in the MCA record (Section 2) different than the processor id / APIC ID (37) in sections 0 and 1?

The answer to this one is that the numbers have different meanings, and different sources.  The one in sections 0 and 1 is the initial APIC ID of the CPU which reported the machine check.  The APIC ID for a logical CPU is set by the hardware on boot.  The processor number in Section 2 is the logical processor number (the value returned from KeGetCurrentProcessorNumberEx) of the processor which is creating the WHEA error record. This may or may not be the same processor which reported the machine check error, depending on the IRQL which the processor generating the error was running.  If the IRQL was < DISPATCH_LEVEL, then it is scheduled to run on the reporting processor.  Otherwise, it may be logged on a different processor.

How do you map APIC IDs to logical IDs?
One way is using the !smt debugger extension.  This shows the APIC IDs and logical CPU number for all CPUs.

No PRCB             SMT Set                                                                             APIC Id
0 fffff8000da3ee80 **-------------------------------------------------------------- (0000000000000003) 0x00000080
1 fffff8800260e180 **-------------------------------------------------------------- (0000000000000003) 0x00000081

2.  How do you interpret the MCA error “Internal unclassified (Proc 31 Bank 5)”?

In order to make sense of these, you need to determine a few pieces of information, then refer to the specific processor guide.

As I mentioned previously, for this particular system, it is an Intel system so these are the resources you need to use:

Section 15 in the  Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
Appendix E in Volume 3B has additional details on interpreting Machine-Check error codes

 

a. CPU ID – What Family, Model, and Stepping is the CPU?

!cpuid can show you this.  Or, you can parse it from the CPU ID in section 1.  In this case it is:
CPU Id        : e6 06 02 00 00 08 2037 - bd e3 bc 00 ff fb eb bf // Family 6, Model 2e, stepping 6

Table B-1 in Appendix B of the Intel guide says that this Family and Model is a “Intel Xeon Processor 7500 Series”

 

b. What is the MCA Error code?

In order to find this out, we need to parse the MCi_STATUS structure.  The ‘i’ is used in the Intel guides as a placeholder for the bank number.  An error bank is a processor specific set of MSRs.  Some banks are publically documented in what the type of error represents, and some are not.  If the bank is not documented, then you will need to check with the processor manufacturer.

 

Now that we know the processor family and model, we can look up the meaning of specific bank of registers.  These are listed in this form: MSR_MCi_STATUS.  So since we know the bank number is 5, we can find the meaning of MSR_MC5_STATUS.  Here’s what the Intel guide shows:

 

Table B5 MSRS IN THE INTEL® MICROARCHITECTURE CODENAME NEHALEM

 

Table B-5

Register (hex) Register (dec) Register Name Scope Bit Description
414H 1044 MSR_MC5_STATUS Core See Section 15.3.2.1, “IA32_MCi_CTL MSRs.”
415H 1045 MSR_MC5_STATUS Core See Section 15.3.2.2, “IA32_MCi_STATUS MSRS.”
416H 1046 MSR_MC5_ADDR Core See Section 15.3.2.3, “IA32_MCi_ADDR MSRs.”
417H 1047 MSR_MC5_MISC Core See Section 15.3.2.4, “IA32_MCi_MISC MSRs.”

 

Now,referring to section 15.3.2.2, we can decode the value:

MCI_STATUS
+0x000 McaErrorCode     : 0x405  // binary:  0000 0100 0000 0101
+0x002 ModelErrorCode   : 0x40  // binary: 0000 0000 0100 0000 // bit 22
+0x004 OtherInformation : 0y00000000000000000000000 (0)
+0x004 ActionRequired   : 0y0
+0x004 Signalling       : 0y0
+0x004 ContextCorrupt   : 0y1
+0x004 AddressValid     : 0y0
+0x004 MiscValid        : 0y1
+0x004 ErrorEnabled     : 0y1
+0x004 UncorrectedError : 0y1
+0x004 StatusOverFlow   : 0y1
+0x004 Valid            : 0y1
+0x000 QuadPart         : 0xfa000000`00400405

 

Section 15.9 discusses how to interpret these error codes.  From Table 8, “IA32_MCi_Status [15:0] Simple Error Code Encoding”, the meaning is given as:

Internal Unclassified 0000 01xx xxxx xxxx Internal unclassified errors.

 

This is why the error shows as “Internal Unclassified”.  Since this is not a publicly documented code, the next step would be to contact Intel for further information.  But, at least now you have verified the information and will have good data to send to the hardware manufacturer.  In other cases, the bank and MCA code may be more clearly documented and further action could be taken.

 

Further Reading:

There is more information regarding WHEA on MSDN and in several WinHEC conference presentations on the Microsoft site:

WHEA Platform Implementation

WHEA System Design and Implementation

 

I hope this information was useful to understand how to interpret WHEA and MCA error codes. Until next time!

Comments

  • Anonymous
    January 29, 2011
    I had this problem when my client motherboard died last summer.  I went through much the same routine you did though you found some new things I wasn't aware of.  My blog has details on what I did: davidcmoisan.wordpress.com/.../bad-hardware-day-more-on-hardware-bluescreens Most importantly, I have URL's for technical reference for the AMD chips--it was harder to find that information as I run AMD in my home office.

  • Anonymous
    February 15, 2011
    Great post! Question: In the WHEA paper there is discussion about applications that can capture all of these events i.e "error management applications". Are there any authoritative apps you would recommend to use ?  -Alex [Alex –The WHEA error management applications that I’ve seen are released by OEMs and shipped as part of their server management suite.  I don’t know of any ‘all purpose’ WHEA error management apps to recommend, though.]

  • Anonymous
    October 26, 2014
    Excellent post, very useful.