A case study on a OutOfMemoryException crash investigation
Recently, I encountered an OOM issue in our load runs. It took me a while to repro it and created a full dump. But once we got the dump, it was pretty straightforward to find the "leak".
1. Is this a native memory leak or .NET memory leak?
0:072> !address -summary
-------------------- Usage SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Pct(Busy) Usage
390e8000 ( 934816) : 44.58% 65.56% : RegionUsageIsVAD
28f79000 ( 671204) : 32.01% 00.00% : RegionUsageFree
9c2f000 ( 159932) : 07.63% 11.22% : RegionUsageImage
30cd000 ( 49972) : 02.38% 03.50% : RegionUsageStack
49000 ( 292) : 00.01% 00.02% : RegionUsageTeb
11246000 ( 280856) : 13.39% 19.70% : RegionUsageHeap
0 ( 0) : 00.00% 00.00% : RegionUsagePageHeap
1000 ( 4) : 00.00% 00.00% : RegionUsagePeb
1000 ( 4) : 00.00% 00.00% : RegionUsageProcessParametrs
2000 ( 8) : 00.00% 00.00% : RegionUsageEnvironmentBlock
Tot: 7fff0000 (2097088 KB) Busy: 57077000 (1425884 KB)
-------------------- Type SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
28f79000 ( 671204) : 32.01% : <free>
9eb7000 ( 162524) : 07.75% : MEM_IMAGE
1c9c000 ( 29296) : 01.40% : MEM_MAPPED
4b524000 ( 1234064) : 58.85% : MEM_PRIVATE
-------------------- State SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
407c5000 ( 1056532) : 50.38% : MEM_COMMIT
28f79000 ( 671204) : 32.01% : MEM_FREE
168b2000 ( 369352) : 17.61% : MEM_RESERVE
RegionUsageIsVAD is big and VirtualAlloc goes here (e.g., .NET heap).
0:072> !EEHeap -gc
Number of GC Heaps: 4
------------------------------
...
------------------------------
GC Heap Size: Size: 0x2611db9c (638704540) bytes.
Ok, .NET memory usage is pretty high at 630 MB.
2. Which .NET objects are we leaking
0:072> !dumpheap -stat
------------------------------
Heap 0
total 0 objects
------------------------------
Heap 1
total 0 objects
------------------------------
Heap 2
total 0 objects
------------------------------
Heap 3
total 0 objects
------------------------------
total 0 objects
Statistics:
MT Count TotalSize Class Name
7aef7c48 1 12 System.Collections.Generic.SortedList`2+KeyList[[System.String, mscorlib],[System.Object, mscorlib]]
...
79b57ee8 84541 117150884 System.Object[]
1d6af048 15013412 420375536 Microsoft.VisualStudio.TestTools.Common.TimerResult
Ok, there are 15 millions of Microsoft.VisualStudio.TestTools.Common.TimerResult objects. !gcroot confirmed they are rooted.
3. Where are we leaking it?
Use "!dumpheap -mt" and "!do" to examine a few of them:
0:072> !dumpheap -mt 1d6af048
------------------------------
Heap 0
Address MT Size
009aa8e8 1d6af048 28
009aabf8 1d6af048 28
...
0133f0e8 1d6af048 28
0133f104 1d6af048 28
072> !dumpobj 0133f0e8
Name: Microsoft.VisualStudio.TestTools.Common.TimerResult
MethodTable: 1d6af048
EEClass: 1e969b88
Size: 28(0x1c) bytes
File: D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
Fields:
MT Field Offset Type VT Attr Value Name
79b9d470 40003d2 4 System.String 0 instance 08aa3eb4 m_name
79ba5af0 40003d3 8 System.DateTime 1 instance 0133f0f0 m_startTime
79ba4a94 40003d4 10 System.TimeSpan 1 instance 0133f0f8 m_duration
0:072> !dumpobj -nofields 08aa3eb4
Name: System.String
MethodTable: 79b9d470
EEClass: 798d5ec0
Size: 58(0x3a) bytes
File: D:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String: _TWI_GetRandomWorkItem
Almost all TimerResult have the same name " _TWI_GetRandomWorkItem". And this is enough for us to track down the "leak". It turns out that in our test code, we have a poorly written algorithm to get a random sequence of items from a list. And it might leads to a "infinite" loop and keeps calling an method that calls TestContext.BeginTimer/TestContext.EndTimer. So each call will create a new Timer Result object. This will eventually cause the app to crash due to OutOfMemoryException.
We can also drill down further on these 15 million objects. Are they stored in a few collections? Yes, 4 big object[] are holding about 11 million of them.
0:022> !dumpheap -mt 79b57ee8 -min 1000000
------------------------------
Heap 0
Address MT Size
11863098 79b57ee8 16777232
total 0 objects
------------------------------
Heap 1
Address MT Size
129d3cf8 79b57ee8 30433544
58fc0038 79b57ee8 33554448
total 0 objects
------------------------------
Heap 2
Address MT Size
54420038 79b57ee8 33554448
total 0 objects
------------------------------
Heap 3
Address MT Size
total 0 objects
------------------------------
total 0 objects
Statistics:
MT Count TotalSize Class Name
79b57ee8 4 114319672 System.Object[]
Total 4 objects
Spot check some entries in these huge object[] and verified that they are the same TimerResult objects with the same name _TWI_GetRandomWorkItem:
0:022> !dumparray -start 10000 -length 2 -details 11863098
Name: Microsoft.VisualStudio.TestTools.Common.TimerResult[]
MethodTable: 79b57ee8
EEClass: 798d69a8
Size: 16777232(0x1000010) bytes
Array: Rank 1, Number of elements 4194304, Type CLASS
Element Methodtable: 1d6af048
[10000] 0c7fab28
Name: Microsoft.VisualStudio.TestTools.Common.TimerResult
MethodTable: 1d6af048
EEClass: 1e969b88
Size: 28(0x1c) bytes
File: D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
Fields:
MT Field Offset Type VT Attr Value Name
79b9d470 40003d2 4 System.String 0 instance 08aa3eb4 m_name
79ba5af0 40003d3 8 System.DateTime 1 instance 0c7fab30 m_startTime
79ba4a94 40003d4 10 System.TimeSpan 1 instance 0c7fab38 m_duration
[10001] 0c7fab44
Name: Microsoft.VisualStudio.TestTools.Common.TimerResult
MethodTable: 1d6af048
EEClass: 1e969b88
Size: 28(0x1c) bytes
File: D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
Fields:
MT Field Offset Type VT Attr Value Name
79b9d470 40003d2 4 System.String 0 instance 08aa3eb4 m_name
79ba5af0 40003d3 8 System.DateTime 1 instance 0c7fab4c m_startTime
79ba4a94 40003d4 10 System.TimeSpan 1 instance 0c7fab54 m_duration
0:022> !do -nofields 08aa3eb4
Name: System.String
MethodTable: 79b9d470
EEClass: 798d5ec0
Size: 58(0x3a) bytes
File: D:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String: _TWI_GetRandomWorkItem
Anyway, we fixed the issue by switching to a better algorithm of picking a random sequence of items from a collection.
Have a good day.
Jiange