Considerations For Improving NDIS Driver Performance (Windows Embedded CE 6.0)
1/6/2010
Optimizing NDIS driver performance varies from device to device. Not all of the optimization techniques presented here will improve your driver performance issues. However, each of these techniques has improved performance in some scenarios.
Whenever you change your driver, re-evaluate driver performance. Although a driver change might not affect performance of the base driver, the change might result in other system changes that impair driver performance by creating new bottlenecks.
General Optimization Techniques
The following list shows general techniques for improving driver performance:
Avoid copying memory.
For incoming packets, use NdisMIndicateReceivePacket to indicate multiple packets received by the miniport.
Note
Do not set the status of the indicated packets to NDIS_STATUS_RESOURCES.
Note
The miniport should always pass ownership of the packets to NDIS. The miniport driver should lass the ownership of the packets to NDIS, which improves performance.
Note
Set the NDIS_ATTRIBUTE_ALWAYS_GIVES_RX_PACKET_OWNERSHIP attribute when calling NdisMSetAttributesEx to inform NDIS that NDIS_STATUS_RESOURCES are not indicated.
Implement MiniportSendPackets instead of MiniportSend.
This allows protocol layers to interact in batch mode.
Register the MiniportSendPackets handler through NdisRegisterProtocol.Note
MiniportSendPackets will be called with multiple NDIS_PACKET only in the gateway scenario, when the Layer 2 bridge module is involved. The TCPIP stack sends one packet at a time when using this API.
Create a local function to perform cache flushes.
This alleviates kernel calls for frequent cache flushes.Microsoft suggests that you review all instances of NdisMSleep or NdisStallExecution.
These functions are used to block processing in the current thread for a specified interval. Using them might have adverse effects on system performance. NdisMSleep has millisecond granularity and allows other threads in the system to run. NdisStallExecution allows microsecond granularity, but spins the CPU while stalling and should therefore be used only where very short delays are required. Do not use NdisStallExecution to stall in excess of 50 milliseconds.Process all available data when your MiniportInterruptHandler is called.
Execute low-priority tasks as work items instead of as timer events
Consider reducing the maximum number of packets processed for each interrupt, if your system has multiple miniports and seems to be suffering from high packet latency or miniport buffer overflow errors.
High packet latency can be identified by high round-trip time (RTT) in a UDP multi-packet, ping-pong test.Be aware of protocols that place a miniport in promiscuous mode.
If two protocols are bound to a miniport in promiscuous mode, all packets sent on one protocol must be looped back to the other protocol. This will degrade performance. To determine whether a protocol has placed a miniport in promiscuous mode, send OID_GEN_CURRENT_PACKET_FILTER to all miniports. If NDIS_PACKET_TYPE_PROMISCOUS returns, the miniport is in promiscuous mode.Register a dedicated interrupt, if all adapters have dedicated interrupts.
Doing so will improve performance. To register a dedicated interrupt, remove the implementation of MiniportInterruptEnableHandler, and then call NdisMRegisterInterrupt with both RequestISR and SharedInterrupt set to FALSE. Disable the CPU interrupt in the interrupt service routine (ISR) for the hardware abstraction layer (HAL). Disable the adapter interrupt in the MiniportInterruptDisableHandler, and enable it at the end of MiniportInterruptHandler processing.Register a ReceivePacketHandler with NDIS by using the NdisRegisterProtocol function.
Set the IMGNOSHAREETH environment variable to remove VMINI support from your image.
VMINI support can improve or degrade performance in actual network driver performance testing. Microsoft recommends that you not run performance tests over a VMINI network.If you are running a closed system, such as a gateway, consider running your run-time image in kernel mode. **
Note
For information about enabling and disabling full-kernel mode programmatically, see Memory Access Permissions.
Use shared IST when your hardware platform has multiple miniports.
Shared IST is transparent to the miniport driver, and is controlled by the HAL.
Set the following registry entries manually, as applicable:
The following registry entry causes NDIS to skip checking for packet loop back whenever a packet is sent.
[HKEY_LOCAL_MACHINE\Comm\NDIS\Parms]
"NeverLoopbackPackets"=dword:1
The following registry entry should be set to optimize return packet handling, if all miniports calling NdisMIndicateReceivePacket are deserialized.
[HKEY_LOCAL_MACHINE\Comm\NDIS\Parms]
"AllMiniportsDeserialized"=dword:1
The following registry entry can be set to enable receive path optimizations. This value is enabled by default, but it can be disabled by setting it to 0. This value is intended for devices that are not dynamically bound and unbound. If your driver is unbound from your adapter, it might result in a system crash.
[HKEY_LOCAL_MACHINE\Comm\NDIS\Parms]
"OptimizeReceiveHandling"=dword:1
In order for the OptimizeReceiveHandling
key to take effect, the following conditions must be met:
- The miniport adapter medium type must be 802_3.
- The miniport must be bound to a single driver protocol, such as TCP/IP. If it is bound to two or more protocols, such as TCP/IP and TCP/IP 6, then this key is disabled.
- The protocol driver must have a ProtocolReceivePacket handler.
- The miniport driver must have the NDIS_ATTRIBUTE_ALWAYS_GIVE_RX_PACKET_OWNERSHIP flag set when it calls NdisMSetAttributesEx. By setting this flag, the miniport driver will never set the packet status to STATUS_RESOURCES when it indicates packets have been received.
- The miniport must be deserialized.
Awareness of these optimization concepts can help you improve network driver performance:
- A high hit rate in INTERRUPTS_ENABLE in a Monte Carlo profiling log often indicates that the miniport driver is set to work in a shared interrupt environment. If interrupts are shared, a high hit rate is expected. If interrupts are not shared, then register your interrupt as dedicated.
- In a gateway scenario with the shared interrupt service thread (IST) enabled, all routing work is done by one thread. Kernel Tracker is a graphical tool that provides a visual representation of a remote Windows Embedded CE system on a development workstation. CELog is the Windows Embedded CE event tracking engine. These tools can help you analyze driver performance. Specifically, while you use Kernel Tracker to view CELog data, if excessive thread switches are seen when data is being routed, it is important to understand why these excessive switches are occurring. The reason will vary from implementation to implementation.
- All MiniportInterruptHandlers and NdisMTimer events run on the same thread. MiniportSendPackets might also run on this thread. As a result, if the CPU stalls during any one of these events, this thread could stall, which would prevent other interrupts from being serviced or data to be sent.
Advanced Optimization Techniques
The following list shows optimization techniques that you can use in addition to the general optimization techniques listed earlier:
- Group data structures into the same page of memory to reduce translation look-aside buffer (TLB) misses
- If the number of instructions that are executed in a loop exceeds ICacheSize divided by the number of bytes per instruction, then spread the execution into smaller loops.
For example, if the ICacheSize is 8 KB and the number of bytes per instruction is 4, then the number of instructions executed in a loop should not be more than 2 KB. - NdisMStartBufferPhysicalMapping is fairly time consuming because it uses LockPages. If you have a multiple miniport system, consider making all miniports aware of each other's physical-to-virtual address mappings. These mappings are established when the miniport calls NdisMAllocateSharedMemory. The physical-to-virtual address mappings can be used to indicate whether a cache flush for a given buffer is necessary.
- If your Windows Embedded CE-based device experiences excessive TLB misses and has software TLB miss handling, for example MIPS and SH, you might be able to reduce the number of misses by loading OS components into kernel memory. CELog (CELZONE_TLB, 0x8) can present an indication of high TLB misses.
For Windows Embedded CE-based devices that do not have software for handling TLB misses, check with the chip manufacturer about whether the microprocessor supports on-board measurement of TLB misses. If so, the manufacturer might have tools for viewing that data. For more information on how this data is gathered, see Event Tracking. For information about viewing the data, see Remote Kernel Tracker. - An x86 hardware platform takes care of cache coherence for you. You do not need to explicitly flush cached memory. For potential performance improvement, consider caching the DMA buffer. Measure performance on all scenarios before and after making this change, because it could either improve or degrade performance. The result will depend on the scenario.
- Decide whether it is optimal to flush the cached buffer or copy the cached buffer to an uncached memory region for a DMA transfer. The best approach varies from system to system and depends on the size of the buffer and the cost of accessing uncached memory versus the cost of flushing the cache buffer. Try both methods to measure, which is optimal for your system. Independent hardware vendors (IHVs) writing drivers that could be used on different hardware platforms should make this option one that can be configured.
If you decide to implement cached buffer access with flushing, ensure that your Windows Embedded CE-based device has an efficient cache-flushing capability that allows only the specific range to be flushed. - To examine the amount of CPU time spent in program, driver, OS, or network code, use CELog to stamp entry and exit events.
- When a target device has a small data cache, it might be efficient to use uncached memory addresses for buffers that are attached to a packet descriptor. The method for obtaining an uncached address for a cached address depends on your hardware. If a device is used for packet routing or bridging, code that sends the packet should check whether the address of the buffer is from an uncached memory region. If it is, you do not need to perform a cache flush.
See Also
Concepts
Improving Performance of an NDIS Miniport Driver
Performance Improvements for an NDIS Miniport Driver