Mainstream NUMA and the TCP/IP stack: Final Thoughts

This is a continuation of Part IV of this article posted here

Note that a final version of a white paper tying this series of five blog entries together (and a Powerpoint presentation on the subject) are attached.

For many years, the effort to improve network performance on Windows and other platforms focused on reducing the host processing requirements associated with the need to service frequent interrupts from the NIC. In the many-core era where the clock speeds of processors are constrained by power considerations, this strategy is inadequate to the growing host processing requirements that accompany high-speed networking. It is necessary to augment technologies like interrupt moderation and TCP Offload Engine that improve the efficiency of network I/O with an approach that allows TCP/IP Receive packets to be processed in parallel across multiple CPUs. Together, MSI-X and RSS are technologies that enable host processing of TCP/IP packets to scale in the many-core world, albeit not without some compromises with the prevailing model of networking using isolated, layered components.

Using MSI-X and RSS, for example, the Intel 82598 10 Gigabit Ethernet Controller mentioned earlier can be mapped to a maximum of 16 processor cores that could then be devoted to networking I/O interrupt handling. Capacity-wise, this is still not sufficient processing capacity to handle the theoretical maximum load equation 3 predicts for a 10 Gb Ethernet card, but it does represent a substantial scalability improvement.

 With this understanding of what MSI-X and RSS accomplishes, let’s return for a moment to our NUMA server machine shown in Figure 6 below.

NUMA server with multiple RSS queues

With MSI-X and Receive-Side Scaling, CPU 0 on node A and CPU 1 on node B are both enabled for processing network interrupts. Since RSS schedules the NDIS DPC to run on the same processor as the ISR, even at moderate networking loads, CPU 0 and 1 for all practical purposes become dedicated to the processing of high priority networking interrupts.

Numerous economies of scale accrue using this approach. The same RSS process that sends all Receive packets from a single TCP connection to a specific CPU for processing improves the efficiency of that processing. The instruction execution rate of the TCP/IP protocol stack is enhanced significantly through this scheduling mechanism that enforces localization. Ultimately, TCP/IP application data buffers need to be allocated from local node memory and processed by threads confined to that node. Recently used data and instructions that networking ISRs and DPCs issue tend to reside in the dedicated cache (or caches) associated with the processor devoted to network I/O. Or, at the very least, they migrate to the last level cache that is shared by all the processors on the same NUMA node.

Ultimately, of course, the TCP layer hands data from the network I/O to an application layer that is ready to receive and process it. The implications of RSS for the application threads that process TCP receive packets and build responses for TCP/IP to send back to network clients ought to be obvious, but I will spell them out anyway. For optimal performance, these application processing threads also need to be directed to run on the same NUMA node where the TCP Receive packet was processed. This localization of the application’s threads should, of course, be subject to other load balancing considerations to prevent the ideal node from becoming severely over-committed while other CPUs on other nodes are idling or under-utilized. The performance penalty for an application thread that must run on a different node than the one that processed the original TCP/IP Receive packet is considerable because it must access the data payload of the request remotely. Networked applications need to understand these performance and capacity considerations and schedule their threads accordingly to balance the work across NUMA nodes optimally.

Consider the ASP.NET application threads that process incoming HTTP Requests and generate HTTP Response messages. If the HTTP Request packet is processed by CPU 0 on node A in a NUMA machine, the Request packet payload is allocated in node A local memory. The ASP.NET application thread running in User mode that processes that incoming HTTP Request will run much more efficiently if it is scheduled to run on one of the other processors on node A, where it can access the payload and build the Response message using local node memory.

There is currently no mechanism in Windows today for kernel mode drivers like ndis.sys and http.sys to communicate to the application layers above them and specify the NUMA node on which that packet was originally processed. Communicating that information to the application layer is another grievous violation of the principle of isolation in the network protocol stack, but it is a necessary step to improve the performance of networking applications in the many-core era where even moderately sized server machines have NUMA characteristics.

Herb Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal, March 1, 2005. https://www.ddj.com/architect/184405990 

NTttcp performance testing tool: https://www.microsoft.com/whdc/device/network/TCP_tool.mspx

Windows Performance Toolkit (WPT, aka xperf): https://msdn.microsoft.com/en-us/library/cc305218.aspx

David Kanter, “The Common System Interface: Intel's Future Interconnect,” https://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&mode=print  

Windows NUMA support: https://msdn.microsoft.com/en-us/library/aa363804.aspx

Intel white paper: Accelerating High-Speed Networking with Intel® I/O Acceleration Technology

Mark B. Friedman, “An Introduction to SAN Capacity Planning,” Proceedings, Computer Measurement Group, Dec. 2001.

Jeffrey Mogul’s “TCP offload is a dumb idea whose time has come,” Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, 2003. https://portal.acm.org/citation.cfm?id=1251059&dl=ACM&coll=portal&CFID=71988909&CFTOKEN=98964748

Dell Computer Corporation, “Boosting Data Transfer with TCP Offload Engine Technology.”

Microsoft Corporation, KB 951037, https://support.microsoft.com/kb/951037

Microsoft Corporation, Windows Driver Development Kit (DDK) documentation, https://msdn.microsoft.com/en-us/library/cc264906.aspx

Microsoft Corporation, KB 927168, https://support.microsoft.com/kb/927168

Microsoft Corporation, NDIS 6.0 Receive-Side Scaling documentation, https://msdn.microsoft.com/en-us/library/ms795609.aspx

https://cid-12a53f90793d2c8b.skydrive.live.com/self.aspx/DDPEBlogImages/Presentations%20and%20Papers/Mainstream%20NUMA%20and%20the%20TCP%20|5CMG%20paper%208220%20draft|6.docx

Comments

  • Anonymous
    September 18, 2008
    You've been kicked (a good thing) - Trackback from DotNetKicks.com

  • Anonymous
    September 22, 2008
    Where's the white paper link?  I see the presentation link, but not the promised white paper.

  • Anonymous
    September 22, 2008
    Hmmm. For some reason, it looks as if only one attachment is permitted. You should see the white paper now. The powerpoint presentation is available separately at http://cid-12a53f90793d2c8b.skydrive.live.com/self.aspx/DDPEBlogImages/Presentations%20and%20Papers/Mainstream%20NUMA%20and%20the%20TCP.pptx, which you should be able to copy and paste into your browser.

  • Anonymous
    February 17, 2009
    A question arised reading your final thoughts. How about the kernel drivers taking into consideration the process/thread affinity when scheduling packed processing. Since i think the kernel driver can find the numa node a process is attached to. http://code.msdn.microsoft.com/64plusLP. Moreover a process can even dynamically switch nodes so should packed processing? Great article!!!

  • Anonymous
    February 17, 2009
    I envision something like that, too. One important constraint is that if you change the node affinity of the processing thread, you have to think about relocating the data the thread is processing at the same time. This goes for both User mode and kernel threads. In SQL Server, for example, if there is a good deal of data associated with the request, there is substantial reluctance to change the node affinity of the thread processing the request. Without relocating, scheduling the request to run on a node different from the one where its associated data is located is counter-productive. This suggests that for some primarily read-only data structures it will be worthwhile -- if memory is available -- to create per node replicas.