Introduction to NUMA
by Bob Golding
The introduction of NUMA (Non-Uniform Memory Access) required changes in memory management. Since accessing memory that is not local to the node can result in an access going over an interconnect such as fibre, the memory manager tries to allocate memory locally to avoid the performance cost of remote access. This is how the mechanism works in 2008.
What is NUMA?
The NUMA architecture is basically a small number of processors each having its own memory and possibly its own I/O channels. Each processor group can access another group’s memory without worrying about coherency. Each group of CPUs is called a ‘node’. Memory that is local to a node is called local or near memory. Memory outside of a node is called foreign or far memory. Local memory is on the same node as the group of processors – although we do support configurations where some memory nodes may not have any local CPUs. Far memory is memory that is local to other nodes, however for a remote node to access the memory it may have to go over an interconnect such as fibre. This is more expensive so the OS tracks which node the physical memory resides on and uses this information to optimally allocate accordingly.
How is memory tracked?
In versions prior to Windows XP, each memory page had its own ‘color’ or cache locality. When a memory page was on the Free or Zeroed list, it also was on a color list. This mechanism was enhanced so the color includes both the processor node number and each node has a number of colors. When the system is initialized the memory manager calls a function called HalpNumaQueryPageToNode to get the node number for a physical address.
How is the memory organized?
There are two lists; one is the Zeroed Memory list and the other is the Free Memory list. For example:
nt!MmFreePagesByColor = struct _MMCOLOR_TABLES *[2]
This location has two pointers that point to the Free and Zeroed color lists. The first address in the array is a pointer to the Zeroed color list. The other points to the Free color list. Each entry looks like this:
nt!_MMCOLOR_TABLES
+0x000 Flink : Uint8B <<-- Page #
+0x008 Blink : Ptr64 Void <<-- PFN address
+0x010 Count : Uint8B
To find out how many tables there are you need to look at a number of locations. The example below is from a system with 4 nodes:
nt!MmSecondaryColorNodeShift = 0x3
nt!MmSecondaryColorMask = 7
MmSecondaryColorNodeShift is how many times a color is shifted to get the offset to the first color for the node. The other, MmSecondaryColorMask is the mask that is used to get the color within the node. The mask is used to mask the page number and the result is OR’d into the shifted result to get the offset.
Can I have an example?
Ok, for example, let’s take the page 86152d which has been assigned to Node 1:
14: kd> !pfn 86152d
PFN 0086152D at address FFFFFA801923F870
flink 0086152C blink / share count 0086152E pteaddress FFFFF6FC0430A968
reference count 0000 used entry count 0000 Cached color 1 Priority 0
restore pte 00861525 containing page FFFFFFFFFFFFF Zeroed
For the forward and backward links for the color table, the restore PTE is the forward (861525) and the containing page is the back link (-1).
So to get the offset into the color table use ( 1 << 3 | ( 0x7 & 86152D) * 0x18 (0x18 is the size of each color table entry):
14: kd> dq fffffa80`317fffd0+(18*d) <<-- fffffa80`317fffd0 is the start of the Zeroed color table
fffffa80`31800108 00000000`0086152d fffffa80`18bd5670
fffffa80`31800118 00000000`0000445c
Are there any debugger extensions that will help with this?
To get a display of what memory belongs to what node use !numa_hal:
14: kd> !numa_hal
HAL NUMA Summary
----------------
Node Count : 4
Processor Count : 16
Node ProximityId
------------------
0x00 0x00000000
0x01 0x00000001
0x02 0x00000002
0x03 0x00000003
Proc Domain APIC Id
---------------------------
0x00 0x00000000 0x00000000
0x01 0x00000000 0x00000001
0x02 0x00000000 0x00000002
0x03 0x00000000 0x00000003
0x04 0x00000001 0x00000004
0x05 0x00000001 0x00000005
0x06 0x00000001 0x00000006
0x07 0x00000001 0x00000007
0x08 0x00000002 0x00000008
0x09 0x00000002 0x00000009
0x0A 0x00000002 0x0000000A
0x0B 0x00000002 0x0000000B
0x0C 0x00000003 0x0000000C
0x0D 0x00000003 0x0000000D
0x0E 0x00000003 0x0000000E
0x0F 0x00000003 0x0000000F
Domain Range
-----------------
0x00000000 0x0000000000000000 -> 0x0000000480000000
0x00000001 0x0000000480000000 -> 0x0000000880000000
0x00000002 0x0000000880000000 -> 0x0000000C80000000
0x00000003 0x0000000C80000000 -> 0xFFFFFFFFFFFFFFFF
As you can see from above the memory is assigned linearly by node. What kind of problem do you think MmAllocatePagesForMdlEx will cause if highest address was fffff000 and it ran on CPU 9? What if a number of requests ran on all nodes except 0?
Epilog
The question asked above actually happened. The answer is the machine would ‘pause’ for a period of time as it futilely searched node 2’s memory to find pages to satisfy the request (before eventually searching the other nodes). That is the issue that we worked on that resulted in this research. I hope this gives a better understanding of NUMA and memory management.
Bob Golding has been with Microsoft since 1997. He is a Senior Escalation Engineer on the Global Escalation Services team where he supports Microsoft's largest customers with their most critical issues. Bob can be reached at rgolding@microsoft.com. For more information about debugging Windows, visit https://blogs.msdn.com/ntdebugging.