Help! My Server is Shutting Down for No Apparent Reason
Hello - Rob here with the GES team, and I have this nugget to pass on to you. I recently worked an issue where a Windows server rebooted intermittently for no apparent reason. The Windows System Event log did not yield any clues, other than this Event ID 6008-
Log Name: System.evt
Source: EventLog
Date: 25-8-2008 19:06:58
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: A2A000001
Description: The previous system shutdown at 6:54:04 PM on 8/25/2008 was unexpected.
There were no other symptoms or patterns to which the unexpected shutdown could be related. The shutdown could occur anytime of the day. Eventually we attached a debugger to see if we could catch anything, but this wasn’t successful. Next we looked at the manufacturer’s mechanism used to log errors and found this piece of information -
An Unrecoverable System Error has occurred (Error code 0x0000002D, 0x00000000)
Note - each vendor has their own way to handle error codes. We noticed a one to one relationship with the vendor error above and the Event ID 6008 messages in the Windows System Event log. So we engaged the hardware vendor who determined this error indicated an error on the PCI bus. They also informed us that this kind of error asserts an NMI on the bus.
To narrow down which component was causing the error, we set the NMICrashDump DWORD value under the following key in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
This is described in detail in the article, “927069 How to generate a complete crash dump file or a kernel crash dump file by using an NMI on a Windows-based system”
https://support.microsoft.com/default.aspx?scid=kb;EN-US;927069
This registry value causes the machine to bugcheck with a STOP 0x80 (NMI_HARDWARE_FAILURE) when Windows detects an NMI, thus producing a dump file, or, if a debugger is attached, it breaks into the debugger
After setting this registry value we hooked up the debugger again and waited... after awhile we got lucky because the debugger intercepted a STOP 0x80!
At that time, I ran “!pci 0x102 ff” to get an overview of the various PCI devices and their respective states. The !pci output showed the following output (VendorID and DeviceID have been removed):
PCI Configuration Space (Segment:0000 Bus:00 Device:1e Function:00)
Common Header:
00: VendorID <vendor>
02: DeviceID <device>
04: Command 0147 IOSpaceEn MemSpaceEn BusInitiate PERREn SERREn
06: Status 4010 CapList SERR
08: RevisionID d9
09: ProgIF 01 Subtractive
0a: SubClass 04 PCI-PCI Bridge
0b: BaseClass 06 Bridge Device
0c: CacheLineSize 0000
0d: LatencyTimer 00
0e: HeaderType 01
0f: BIST 00
10: BAR0 00000000
14: BAR1 00000000
18: PriBusNum 00
19: SecBusNum 01
1a: SubBusNum 01
1b: SecLatencyTmr 20
1c: IOBase 20
1d: IOLimit 30
1e: SecStatus 6280 FB2BCapable InitiatorAbort SERR DEVSELTiming:1
20: MemBase f7e0
22: MemLimit f7f0
24: PrefMemBase d801
26: PrefMemLimit dff1
28: PrefBaseHi 00000000
2c: PrefLimitHi 00000000
30: IOBaseHi 0000
32: IOLimitHi 0000
34: CapPtr 50
38: ROMBAR 00000000
3c: IntLine ff
3d: IntPin 00
3e: BridgeCtrl 000b PERRREnable SERREnable VGAEnable
We couldn't have gone much further without the vendor's assistance. They informed us that the Status shows us SERR, which indicates a PCI System Error has occurred in this PCI-PCI Bridge. At this point I had enough conclusive data to pass my findings to the hardware vendor for full collaboration on the problem. They continued investigating the issue.
It should be noted that a hardware problem is not the only reason for an Event ID 6008. A quick search in the Microsoft Knowledge Base illustrates other things that could cause the event id to appear in the Windows System log.
Share this post : |
Comments
Anonymous
March 19, 2009
PingBack from http://blog.a-foton.ru/index.php/2009/03/19/help-my-server-is-shutting-down-for-no-apparent-reason/Anonymous
June 11, 2009
Thanks for giving this info. Ididnt know about this one and its criticalAnonymous
July 13, 2009
I am getting the same error message. My server rebooted itself 5 times in 2 days and cannot figure out why.