DeviceEmulator V2 - how did we get a 40% improvement in performance?

The DeviceEmulator V2 is significantly faster than the V1 emulator you're used to. 

Most of the performance wins come from a small set of optimizations in the ARM-to-x86 JIT and the MMU emulator.  These wins improve raw execution of ARM instructions, so all applications and OSes benefit...

1. Faster Translation Lookaside Buffer (TLB) implementation.

TLBs in hardware are on-chip caches of the results of page table walks.  When the processor maps a virtual address to a physical one, it must consult the page table, which is maintained by the OS, and is located in RAM.  RAM accesses are expensive, so the Memory Management Unit (MMU) keeps a cache of page table entries (PTEs) on-chip.

The DeviceEmulator emulates the ARM MMU.  We do it not for the performance enhancement of avoiding RAM accesses, but instead because the Windows CE operating system has a design assumption that mandates the use of TLBs.  This shows up if an LDM or STM instruction crosses a page boundary - the CE kernel must handle up to two page faults, and handling the second may evict the first PTE from the pagetable.  The cached TLB from the first page table entry is what saves the day.

In the emulator, the jitted code for every memory instruction (LDR, LDM, STR, STM, etc.) has an assocated "TLB Index Cache".  This cache remembers which TLB the instruction last accessed, on the assumption that most instructions have pretty good locality and tend to touch the same page many times in a row.

In V1, if we took a cache miss (either because the emulator evicted a TLB from the cache or because the instruction is accessing a new page), the MMU emulated used a simple linear search of our TLB list to look for a match.  If it failed to find a match, it walked the pagetable memory to find the right PTE.  The linear search has pretty good performance most of the time - it searches at most 64 TLB slots, and often finds a hit with only 1 or 2 iterations.  But the worst case is to access all 64 slots.

In V2, we changed from a linear search to a hashtable.  The "TLB Index Cache" is still used, because it is much cheaper than computing the hash function.  But if the cached entry isn't a match, then we hash the virtual address and look in exactly one alternative TLB slot for a match.

So the net effect is that memory loads and stores that do not have great locality of reference are now much faster, and loads and stores with great locality of reference remain equally fast.

 

2. Reduce x86 processor stalls due to mixed code and data

 

Those "TLB Index Cache" values from #1 proved interesting. The cached value is a byte (actually, a 6-bit index into the array of 64 TLB slots). In V1, these data values were intermixed with executable x86 code in the JIT cache. They were located after unconditional JMP instructions, to ensure that the x86 processor never tried to prefetch or speculate into one.

 

However, we discovered that when we wrote back into those cache bytes after taking a TLB miss, the processor stalled and execution resumed very slowly.

 

In V2, we relocated those "TLB Index Cache" bytes to a separate memory region, keeping the jit cache pure executable code, and the stall went away. My guess is that when the MMU emulator wrote back to the data value, the cache line containing that data value was present inside the x86's instruction cache. The data write to the cache line forced the processor to sync its I-Cache and D-Cache and evict the cache line from the I-Cache. That's where the expense comes from.

 

 

3. More efficient interrupt polling

 

While the emulator is executing ARM code jitted to x86, it must periodically stop and check whether an ARM hardware interrupt must be simulated or not. In V1, we used a simple check:

   if (g_fInterruptPending) {

      SimulateHardwareInterrupt();

   }

this check is performed periodically, at boundaries between ARM instructions.

 

In V2, we changed the check to a much simpler one:

   void SimulateHardwareInterruptIfPending(void)

   { /* note: no code here */

   }

 

   ...

 

   SimulateHardwareInterruptIfPending();

 

The SimulateHardwareInterruptIfPending() function is a no-op by default. However, if the emulator determines that a hardware interrupt is pending, it patches the contents of the SimulateHardwareInterruptifPending() function, replacing it by the code that implements the simulation of the hardware interrupt.

 

So in V2, there is no MOV/CMP/Branch overhead for interrupt polling... just a CALL to a RET instruction. Only if an interrupt is pending, is the RET instruction patched to be the first byte of the SimulateHardwareInterrupt() code. No muss. No fuss. :-)

 

 

4. Optimized memcpy() and memset()

 

The V2 emulator's JIT compiler recognizes the inner loops of memcpy() and memset() and replaces them with hand-tuned x86 code. The original ARM code uses STM instructions to write back to memory, about 32 bytes at a time. The hand-tuned code copies 4k pages at a time, reducing the number of MMU lookups to 1/128th of the original code.

 

 

5. Optimizing "/Od" Code-Gen from the ARM C/C++ Compiler

 

The V1 emulator assumes that the ARM instructions it emulates are optimized code... the output of an optimizing C/C++ compiler. For Visual Studio for Devices, that's an OK assumption. But for booting debug versions of Windows CE and Windows Mobile, the large amount of unoptimized code makes the emulator very slow. The V2 emulator adds a small peephole optimizer to tighten up code sequences like:

   str r0, [sp+10]

   ldr r3, [sp+10]

to:

   str r0, [sp+10]

   mov r0, r3

which saves an expensive MMU operation.

 

The emulator also shares TLB Index Cache values across short runs of ARM code. So if the peephole optimizer wasn't present, the str and ldr would share one TLB Index Cache.

 

 

6. Faster Disassembly of ARM Instructions

 

Like the memcpy()/memset() optimization, the jit's disassembler now calls the MMU once to map the simulated instruction pointer to a physical address, then fetches as many instructions as possible in one batch, up to the edge of the memory page.

 

 

 

 

So that's it. Six simple optimizations gave us a substantial performance win at virtually every benchmark. Keep an eye out for more optimization work in future emulator releases!

 

Barry

Comments