Crash Dump Analysis
I'm sure that many of you have had the unfortunate experience of watching the windows Blue Screen Of Death (BSOD) while working, and possibly have lost important data. A common reaction in this case is to blame Microsoft and continue working after the following reboot, as if nothing had happened. Another unfortunate experience is to see an application crash, while using it. In this case, there is a window that comes up, asking you, if you want to send the data to Microsoft for analysis (the same window also comes up after the BSOD). Many people might be afraid of the contents of the data that is sent to the network, so they select "No". The goal of this post is to help you understand what is going on in the background in each of the two cases and also to help clarify some misconceptions.
QUESTION 1: What causes all these reboots?
First of all, because of the architecture of the windows kernel in the NT/2000/XP/Vista series, an application cannot corrupt data that belongs to another application or to the kernel. This means that each application is totally isolated and cannot harm the system. The worst thing that can happen is that the application does something invalid and crashes without any further implications for the rest of the system. On the other hand, the windows kernel and the device drivers have unlimited access to the system. If the kernel or a driver misbehaves, then it can corrupt the whole system. The immediate result of this, is that the reason for all the blue screens lies either in the windows kernel or in the windows device drivers. That's why, whenever an application crashes, the system keeps working without a problem, whereas if there is a bug in the kernel or in a device driver, the whole system goes down.
Now that we've identified the possible causes of the crashes, it's time to go even further. According to the reports that were sent to Microsoft until April 2004 (from all those people, who pressed "Yes", when they were asked to send the data to Microsoft) the reasons for the crashes can be split as follows:
- Third-party device drivers: 70%
- Unknown, because of severe memory corruption: 15%
- Hardware error: 10%
- Microsoft code: 5%
This shows that Microsoft is not the one to blame. The main cause for these crashes is poorly written third-party (non-Microsoft) device drivers.
QUESTION 2: Why does the system crash, when there is a kernel-mode error?
So, from the above analysis it is obvious that the system can overcome an application crash, however a kernel-mode error causes it to reboot. Why can't the system continue, after finding a kernel-mode error? Actually, what happens is that while kernel-mode code (either a device driver or the windows kernel) is executing, some discrepancy is found. For example, a pointer might be pointing to an invalid address, a data structure might have invalid values, etc. Even though this problem was found, it's possible that the "bad" code that caused this problem might have corrupted more data. For example, it might be possible that basic kernel structures are corrupted. That's why the function KeBugCheckEx is called (inside the kernel, this type of forced crash is called "bugcheck"), in order to write logging information to the page file, paint the blue screen and show some information about the crash in the screen.
QUESTION 3: How do we configure, whether we want to send anything to Microsoft?
Even though, it's not well-known to most people, you can configure what will be sent to Microsoft. Somebody might want to report only the crashes that have to do with the operating system (after the BSOD). Somebody else might want to report only the crashes from the applications (either for all of them or only for some particular applications). In order to configure this, you need to go to
Control Panel | System | Advanced | Error Reporting
From there you'll be able to select which types of errors you want to report. Actually, even if you select a particular type of error to be reported, Windows will ask you again, after a program that belongs in that category crashes. So, by selecting a category, it doesn't mean that all crashes will be reported automatically.
QUESTION 4: Exactly what kind of data is sent to Microsoft?
Before answering this question, it is useful to understand what information is stored, when the system or an application crashes. Let's talk first about the files that are generated after a system crash. In order to set this, you need to go to
Control Panel | System | Advanced | Startup and Recovery -> Settings
In the "System Failure" part you'll be able to configure, if you want to write the crash to the system log, if you want to send an administrative alert and if you want to reboot after the crash. Also, you have the option of creating a memory dump and you can select the directory, in which you want to save it. There are 3 types of memory dumps:
- Small memory dump or minidump (64kb for 32-bit systems, 128kb for 64-bit systems): It includes a minimum amount of information about the system before the crash, e.g. the bugcheck code, the loaded drivers, information for the current process and thread, etc.
- Kernel memory dump: This includes all the kernel-mode memory that was in physical memory at the time of the crash. There is no default size for this dump, however it should be around 50-100MB for most "normal" systems (with <= 2GB of RAM).
- Full memory dump: This includes all the physical memory. The size of the file will be the same as the physical memory of the system.
Here it's worth mentioning that at the time of the crash, the information is written to the pagefile. Therefore, the pagefile must be configured to be larger than the size of the dump. This might be a problem especially in the case of the Full memory dump. In order to set the size of the pagefile, you need to go to
Control Panel | System | Advanced | Performance Settings -> Advanced | Virtual Memory -> Change
After the system reboots, the information is copied from the page file to the file that was specified above. The reason that the information is not written directly to that file is that the kernel doesn't know the root of the problem at the time of the crash, so it's trying to use as fewer drivers as possible (the pagefile is already open and in use, so theoritically it's the safest destination).
On the other hand, in order to configure the data that is stored, when an application crashes, you need to open a command prompt (or go to Start | Run) and execute drwtsn32.exe. This application is called Dr. Watson and there you'll be able to configure the destination file for the dump, as well as the type of the dump. Here the only options are the minidump and the full memory dump. There is also an option to create an old-style NT-compatible full memory dump. In addition, in the textarea "Application Errors" you can look at the application crashes that had been logged in the system. You can select any of them and click "View", if you are interested in looking at exactly what the log file includes.
So, now it's time to answer the initial question: What exactly is sent to Microsoft? The answer is that regardless of the crash dump file that you have selected, the only thing that is sent to Microsoft is a minidump (both for kernel-mode and user-mode crashes). Of course, it would be impossible to send a 100MB kernel-mode dump or a 1GB full-memory dump, so that's why only the 64kb minidump is sent. Apart from the minidump, the information also includes an XML file with basic information about the version of the operating system and the loaded drivers, which you can look at, when you are prompted to select, if you want to send the data to Microsoft. There is no personal private information or anything like that. In fact, you can open the minidumps and check the included information. I'll also show a way of analyzing the minidumps and looking at the data that Microsoft has access to.
QUESTION 5: What does Microsoft do with this information?
When the minidump is received by Microsoft, it goes through some preprocessing and is stored in a server. If many minidumps that seem to have the same problem are received, then there is a team that analyzes them and finds the root of the problem. Afterwards, a webpage is created that shows exactly what the cause of the problem is. Most often it points to the driver causing the problem and gives a link to the manufacturer's webpage, so that a new version can be downloaded (if it exists). After the webpage is created, if somebody submits a dump that has the same problem, he is shown the corresponding webpage that will help him find the solution. Therefore, if somebody clicks "Yes", when asked to submit a crash dump he might either find the solution to the problem or help Microsoft find the solution and present it to the users in the future.
QUESTION 6: How can we analyze a crash dump?
Fortunately, there is a tool that can be used to analyze both the user-mode and the kernel-mode crash dumps: windbg. Microsoft has included the dump analysis algorithms in windbg, so in some basic cases it's easy to find the cause of the problem. Of course, there are many causes, in which there are many corrupted data structures and it's impossible to pinpoint the problem automatically. In that case, more advanced manual methods are used by the Online Crash Analysis (OCA) team in Microsoft.
In order to perform the analysis by yourselves (either because you are unable to submit the data or because there is no answer in Microsoft's website), you need to open windbg, go to "File" | "Open Crash Dump" and select the dump file that you want to analyze. As I wrote in my previous post you need to set the path to the symbol files and reload them. If this step is omitted, then it won't be possible to analyze the file.
The next step is to execute the command (this might take some time):
!analyze -v
At the top of the output you'll see the bugcheck code, it's description and some additional information (e.g. the address of the invalid memory that was accessed, whether it was a Read or a Write operation, etc). Further down you'll see the call stack of the dump under the title STACK_TEXT. This includes all the functions that were called, when the crash occurred. The function on the top is the most current one (it was called by the function below it, which was called by the function below it, etc), whereas the function at the bottom is the oldest function in the stack. The reason that the system crashed is because one of these functions did something invalid (e.g. passed or received an invalid argument that forced it to perform an invalid operation). Of course it's possible that the data was corrupted by another function that is not in the call stack. Fortunately, windbg has already done an automated analysis and points to the module that most probably caused the crash. You should look at the following fields:
- SYMBOL_NAME: Exactly where the invalid operation was caused (module + function)
- MODULE_NAME: The name of the module that caused the crash
- IMAGE_NAME: The file, in which the problematic code resides
In order to find more information about the problematic code you can execute:
lm kv m MODULE_NAME*
for example, if MODULE_NAME is problematic_driver, you should execute:
lm kv m problematic_driver*
lm stands for "list modules", k stands for "kernel modules", v stands for verbose and m stands for "match".
Another option is to find the problematic file name, from the IMAGE_NAME tag, search it in the hard drive and either look at its properties, in order to identify its manufacturer or search it in the internet. Afterwards, you might need to update the buggy driver.
Of course, it's possible that windbg's automatic analysis was not able to pinpoint the faulting driver. The reason for that might be that the call stack was corrupted or that some important code or data structure was overwritten or that there was a memory leak, etc. In that case, you might want to proceed into some manual solutions, in order to detect the problem. From this point there is no automated way to proceed, so I can just provide a few useful commands that might help you find more information about your system.
First you can execute
!process 0 0
or
!process
The first command prints information about all the running processes. this way you might be able to find a suspicious process. The second command shows information about the current process. If you execute
!process <process_address>
then you'll see more information about the particular process. You can find the addresses of the processes from the "!process 0 0" command. The process information includes information about its threads. You can find more information about each thread by executing
!thread <thread_address>
and if the thread belongs to a driver with pending IRPs, then you can find more information about them by executing
!irp <irp_address>
Also, as I wrote above, in order to see all the loaded modules you can try
lm kv
In order to find more information about the used memory (and possibly detect memory leaks), you can execute:
!vm and !memusage
Finally, it's possible that the system hangs and does not crash. In order to debug it, you need to force a crash. The only way to do that is to go to the registry key HKLM\System\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScroll and set it to 1. This works only for PS2 keyboards (not for USB). When the system hangs, you can keep the right control key pressed and press the scroll lock key twice. This will cause a crash, which you will be able to debug using windbg.
A useful command in that case is:
!locks (prints the locks, which are currently held, provided that there is at least one additional thread waiting on them).
Another tool that can help you, if !analyze cannot find the root case is the Driver Verifier. This tool enables additional system checks, that will make it possible for the system to crash immediately, when a driver does something invalid, without allowing it to corrupt more data. This way, the crash dump will point directly to the driver. In order to execute it, you need to open a command prompt with administrative privileges (or from Start | Run) and execute "verifier.exe".
There are some small differences between the User Interface in Windows 2000, Windows XP and Vista, so I'll explain the Windows XP interface. What you'll see is a window and some tasks that you're called to select. You need to select the task "Create custom settings (for code developers)", and then "Select individual settings from a full list". After that you'll see a screen with the following options:
- Special pool: This option forces the memory allocation routines to operate on a special pool. For example, if a driver wants to allocate 100kb, then he is given a pointer that points to 100kb before the end of a free page. The rest of the page is marked with a specific signature. Also, the pages that are before and after this particular page are marked invalid. So, if a driver tries to write something after the end of the allocated space, there will be a page fault and the Driver Verifier will crash the system immediately. If the driver tries to write before the beginning of the allocated space, then after he frees the memory, the Driver Verifier will check the signature of the page, find that it's invalid and crash the system. The crash dump that will be generated from this crash will point directly to the faulting driver.
- Pool tracking: Each space allocation is marked with a special tag that is different for each driver. When the driver is unloaded, the Driver Verifier will check for the corresponding tags and if it finds any, then this means that the driver has a memory leak, so the system will crash.
- Force IRQL checking: Whenever a driver goes to IRQL at DPC/dispatch level or above, the Driver Verifier will cause all the pageable memory to be paged out to disk. So, if the driver tries to access this memory, then the system will crash with a bugcheck code equal to IRQL_NOT_LESS_OR_EQUAL
- I/O verification: All the IRPs are allocated from a special pool. If any of them is completed with an invalid I/O status, then the system crashes.
- Enhanced I/O verification (used to be "I/O Verification level 2" in Windows 2000): This includes even deeper tests for the IRPs. The I/O manager checks if the drivers complete asynchronous IRPs complete correctly, if they manage the device stack locations correctly and if they delete the device objects only once.
- DMA checking: The I/O manager makes sure that all the drivers configure the DMA operations correctly, otherwise it crashes the system.
- Deadlock detection: This option enables deadlock detection. When a deadlock is detected, the system crashes and you can use the !deadlock command from windbg to find more information about what is causing it.
- Low Resources simulation: 7 minutes after the boot completes (so all the drivers have been loaded), the I/O manager starts failing random memory allocations. This way, if a driver doesn't check the status of a memory allocation, the system will crash.
- Disk Integrity Verification (only in Windows Server 2003): Windows keeps checksums of written data, so after each read it checks, if the data is still valid. If there is a discrepancy, the system crashes.
- SCSI Verification (included automatically, when a SCSI miniport driver is monitored): It includes some additional SCSI-related checks.
You should select all of the existing options, apart from the "Low Resources Simulation" (because it includes very heavy operations and the system will be really slow). In the next screen, you need to select the drivers that you want to debug. You should start by selecting drivers that you think that are suspicious (either because windbg pointed to them or because the crashes started after you installed them, etc). Reboot and run the system for some time and observe its behaviour. If the system continues crashing, but not because of the drivers that you specified (i.e. the crash dump files are still vague and don't pinpoint to a specific driver), the next step is to select all the unsigned drivers. If you don't see any result, then the next step is to start enabling driver verifier for bigger groups of drivers, until you find the buggy one.
Another useful tool for debugging memory leaks is poolmon, which is part of the Windows Support Tools and can be found here (for Windows XP). This tool displays information about memory allocations (both from paged pool and from non-paged pool), as well as discrepancies between allocations and deallocations. You just execute the tool from a command prompt (there is no User Interface) and select the types of memory pool that you want to look at. If the amount of allocated memory increases constantly, this means that there is a possible memory leak. You can find an overview here and a more detailed explanation here.
ADDITIONAL RESOURCES:
- This tutorial provides an introductory overview of the same concepts.
- Mark Russinovich presented this video in TechEd 2006. The corresponding slides are here.
- OSR has a more advanced article on the next steps that you can follow, in order to analyze more difficult crash dumps.
Comments
Anonymous
December 12, 2006
very informative... thxAnonymous
December 12, 2006
I like this one.Anonymous
February 02, 2007
Excellent article but don't be so quick to move the blame away from MS. Tho 70% of the BSODS are 3rd party driver related most of the cause is the incredibly bad documentation and information MS would give driver developers in the past. That is improving tho with the mini driver implementation.Anonymous
February 02, 2007
Hi Scott, I wouldn't like this to be a flame war here, since the main point of the article was to provide some further insight about how to debug crash dumps and not whose fault it is, when a computer crashes. Also, I am fairly new in the area of windows device drivers (and in Microsoft), so I cannot judge if what you are saying as "incredibly bad documentation and information" is correct or wrong. From my point of view, though, Microsoft has tried really hard to provide insight in the area: books, newsgroups, mailing lists, blogs, articles in magazines, conferences, requests for feedback, etc. Also, many MS developers can be contacted by email directly (apart from driver-related mailing lists), in order to provide help on any particular problem. And of course, every company can submit their drivers to the WHQL (Windows Hardware Quality Labs), where they go through additional tests by Microsoft. Finally, starting with Vista (and backporting to previous OSs), there has been a big effort to simplify driver development a lot by introducing WDF. So, I don't think that all this effort can be disregarded so lightly. It may be just my (limited) view, however I haven't seen many other companies going in such a hassle. There are many MS employees (a lot smarter than me), who are working really hard trying to help the community with driver-related problems, and I think that it's unfair to them, when somebody blames Microsoft because of a blue screen that was caused by an external non-Microsoft driver.Anonymous
February 28, 2007
I have one server which is dump file is not creating after blue dump.I have checked all option are enabled. anyone can tell me what is reason for this ?Anonymous
March 01, 2007
Hi Dharmendra, you can find some of the possible reasons that explain why the dump file is not created at http://support.microsoft.com/kb/130536Anonymous
March 20, 2007
Re: "incredibly bad documentation and information" Not sure that MS is improving in this regard. After experiencing several blue screens this evening, I wondered if I could check the .dmp files to gain some insight. Opening the .dmp with C++, then start debug. VC++ complains about the dump file format. Google "windows crash dump analysis" suggests a Technet Webcast by Mark Russinovich, Chief Software Architect and Cofounder, Winternals Software (Microsoft Tech·Ed) Sounds good, but to see the broadcast, I have to register. To register for the broadcast, I first have to register for a Microsoft Live ID. But wait, I think I already have one, I can't log in though. OK, get the password emailed to me, manage to log in. Now I have to fill in a page of personal information. It says to press cancel if I don't want to provide this information, but that just kicks me back out to the "register" page. So I log in again, fill out all the information, and I finally get to a page with a link to the actual presentation. Well, not quite. I have to provide my email and company name. OK, so now I am on a page with the actual link, but it is saying that I need to be running Internet Explorer 6.1. Arghh, I have done all of this in Firefox, so I start over again, this time in IE. When I finally get to the page with the broadcast link, I am being asked to install Live Meeting. Which is furnished by a third party, and is not digitally signed. I click cancel and give up. Further Googling leads me to your blog. I have to agree with the reader who made the comment regarding "incredibly bad documentation and information"Anonymous
March 20, 2007
Hi Ingo, Indeed Mark's webcast provides a very good insight in this subject. I also provide the url at the bottom of my post. If you had problems watching it, then I suggest that you follow the link at the webcast that asks for feedback and questions (http://go.microsoft.com/fwlink/?linkid=41781).Anonymous
March 21, 2007
I finally managed to view Mark Russinovich's video, I gave it another try after noticing the url at the bottom of your post. Not only does he do an excellent job of explaining the subject, he is also very entertaining and humorous. Loved those shots of bluescreens in airports and ATM machines. I did try to leave feedback concerning the webcast, but I had to register and install special software (just kidding:-) Thanks.Anonymous
October 04, 2007
Thanks for the wonderful document that explains in details about the crash and it's caresponding analysis. Might be not an exact place to query for, I was looking for some document/tool, which allows me to write the contents of the memory. To explain it elloboratly, Open Debug-> Memory, and place the pointer on to the table, you can view the memory contents, I would like to have that memory dumped or written into a file, How this can be done, For (Linux + gdb), we have seen commands like dump memory "file name" "start address" "end address" Thanks!! Sigu Joseph.Anonymous
December 06, 2007
Please try to use fwrite on that pointer in appropriate placeAnonymous
February 21, 2008
I laughed when I read this, because Dell support basically tells you to just curse Microsoft and reboot the system whenever you get a blue screen. My new Dell Inspiron 1520 laptop constantly blue screens with Kernal errors after applications (such as a browser) hang. Vista comes back and says that it cannot find a hard drive. Dell support says that Vista is crashing due to a software error. Their recommendation: just reboot the system. What is interesting is that your data suggests that this type of error is highly unlikely to be a software problem. What is more laughable is that Microsoft wanted $60 to tell me to go back to Dell to fix their device drivers. Classic two vendor fingerpointing. Meanwhile I have a brand new $2,000 silver brick.Anonymous
March 11, 2008
Bruce, an application CANNOT cause a blue screen. Blue screens are caused by drivers (most of the time) or the kernel (not so probable, since it has gone through lots of testing). Whenever you get a blue screen, you should find the dump file, go through the steps that I talked about in the post, find which driver is the culprit and try to update it. This should fix your problems.Anonymous
July 25, 2008
How can I disable this crash dump?Anonymous
November 09, 2008
thanks for writing about crash dump.i had bsod six months back and i send it to acer repair center three times ,everytime they just rebooted os but that bsod never gone .same problem and in bsod everytime there is differnet message so i get bit confuse what to do next.........can u help me???????Anonymous
February 25, 2009
i get the error only on one specific webpage, it doesnt matter if i open the webpage using a PC or a Mac OS, it just wont open. so, where would this error come from?Anonymous
April 19, 2009
How do you find the dump file???Anonymous
November 16, 2009
Brilliant article to simplify what "Windows Internals" took at least 200 pages to describe. I think for those who want more details on these, they should take some time to read Windows Internals.