Common Sources of Processor Performance Penalties: Five Issues

Artigo
03/01/2005

Modern processors run at incredibly high clock rates and can often execute multiple instructions concurrently. When things are going perfectly these machines plow through instructions at multiple billions per second, yet these maximums are rarely achieved even by the very best programmers.

Continuing in the spirit of my approximately correct articles I thought it would be useful to talk about some of the things that can cause performance problems at a low level. These issues tend to vary from processor to processor; even one brand of processors might have different characteristics from version to version across otherwise highly compatible processor types. Different word sizes and memory architectures also change the relative importance of these issues but the issues below still tend to be the more important ones.

Naturally a complete discussion of these issues would fill a massive tome so this is necessarily incomplete and certainly only approximately correct for any given processor.

Issue #1: Cache Misses

Nothing kills your throughput faster than a ton of cache misses. The bulk of your code and data accesses simply must come out of the cache or you’ll likely be giving up something like a factor of 1000 in execution speed for hitting main memory all the time. If your processor has multiple levels of cache then the hottest code and data must fit in the “nearest” cache to the processor, which is naturally the smallest. Let me break the root cause of the misses into three big buckets:

Raw Size

If you have too much raw code or raw data it simply isn’t going to fit. 10M of code can’t fit into 512k of cache no matter how you slice it.

Managed Code Considerations: The JIT compiler tends to produce code that’s bigger than a raw native compiler would (see Qualitative Code Differences in Managed Code for some reasons why) and managed objects have some per-object overhead associated with them. These can inflate the raw size of your code or data somewhat.

Density

The processor does not fill its caches with memory on a byte-by-byte basis but rather loads whole cache lines which might be more like say 64 bytes. Sometimes the code and/or data could fit but it’s badly organized so that it doesn’t fit. For example you might need to access only 64k of data for some algorithm but the 64k you need is scattered within a 512k data structure. That sounds like an implausible situation but let me make it more concrete.

Example 1: low density due to large objects (internal waste): Suppose you have a linked list of 64 byte objects perfectly cache aligned and you need to walk the list and perform a sum of one of the fields. You will access two dwords (8 bytes) of the 64 to do the job. You’ll be using only 1/8^th of the memory that was fetched – the rest of the memory was useless to you. That’s precisely the ratio that leads to 64k of needed data requiring 512k of cache.

Example 2: low density due to mixing (external waste): Suppose you have that same linked list of objects except that they are now exactly 8 bytes each, however during the allocation process other related and unrelated objects were allocated at the same time and they are side-by-side with the list you need to walk. You have the same problem now because each cache line of data you read has the objects you need plus some other objects not needed for your calculation. You can easily get a density of 1/8 as in example #1.

Poor density is the leading cause of cache misses. Sometimes we talk about this in terms of “locality” but good “locality” merely refers to the tendency of the code to access memory in continuous chucks rather than haphazardly. That is of course just another way of saying that there is a high density of “needed” bytes in any cache line.

Managed Code Considerations: Per object overhead in managed objects makes the object slightly bigger and therefore can cause lower density of data that you need (like in Example 1) but in practice this isn’t the dominant effect density-wise. For data, that overhead tends to be dwarfed by locality benefits that come from using a garbage collected memory system. Allocations together in time tend to go together in space and temporary allocations tend to get squeezed out over time. Those are both very good for locality. Said another way, benefits from “Example 2” patterns more than erase losses in “Example 1” patterns.

See Locality of reference and performance for more details.

Cache Misses Due To Conflict

This is both the most exotic situation and one where managed code doesn’t have any special issues, so I’ll treat it the most briefly.

While any memory can be stored in the cache, normal caches are not able to store any piece of main memory at any place in the cache. Generally there are only a small number of places (like say 4) into which any given piece of main memory can be copied. Because of this you might be so unfortunate that even though you only need a small amount of memory, all the pieces you need happen to be coming from areas of main memory that all need the same cache locations. In that cause you couldn’t store it all in cache despite its small size.

In practice you don’t see a lot of this in real systems because there is such a variety of data being accessed that there is no pattern dominant enough to actually affect the performance. Where you do see this is in small benchmarks which try to magnify a particular code path or access pattern. This is one of the more painful cases to try to diagnose – a fluke conflict in a benchmark that looks like a real performance problem but is really just bad luck in a test case.

Issue #2: Bad Predictions

In order to keep all the components of the CPU “well fed” processors like to try to predict the future at least a little bit. A classic example of this is guessing about the outcomes of branch instructions. By taking a guess the processor can begin (partly) decoding instructions on the more likely path so that execution of those instructions is more economical when their turn comes. Similarly processors might like to predict the outcome of indirect “calls”, “jumps”, and even “returns.” Where prediction is either impossible or incorrect there are delays associated with restarting the processors pipeline. Sometimes processors have limited internal memory used to give useful prediction hints. Things like the previous consequences of branches, or recent call instructions, or even more exotic criteria might be used.

Managed Code Considerations: It’s hard to say if managed code is better or worse than unmanaged code in terms of bad predictions. In favor of managed code is consistent use of exception handling for error cases – this effectively removes a lot of branches from mainstream code, and sometimes/often in code that is very “hot.” Working against managed code is the large number of virtual functions and the associated indirect calls that are made to them, compounded by the fact that managed code writers tend to write smaller functions than their unmanaged counterparts.

Issue #3: Virtual Memory

Of course using more memory is generally slower than using less memory and I’ve already covered those sorts of issues under “Raw Size” above. But let’s talk about the mechanisms of virtual memory specifically as they apply to the processor.

Virtual memory addresses have to be mapped into physical memory addresses. The operating system keeps certain tables up to date to indicate to the processor how that is to be done. However the processor doesn’t want to be looking up those tables on every memory access. To avoid this lookup processors use a “Translation Lookaside Buffer” (TLB) – which is basically yet another cache. The TLB keeps track of recent virtual addresses and their physical mappings and, like everything else on the processor, is a limited resource. If processor is unable to find the virtual address mapping it needs in the TLB then it will have to go to memory (hopefully via the caches) to get those mappings, but this converts a normally free operation into a potentially costly one.

The reasons for TLB misses tend to mirror the ones for cache misses with density being the key problem, only now we’re talking about density of virtual memory pages and not density of cache lines. The “internal” (example 1 above) density problems become much more significant and “external” (example 2 above) density issues can be disastrous.

When considering TLB misses, good density in the code takes an importance equal to if not greater than that of data density. Recalling that, conversely, data density is more likely to dominate the cache problems than code density.

Managed Code Considerations: Jitting code can be a great benefit to your TLBs because related code from different functions is jitted together: code that is called together is jitted together. But conversely that same tendency to write smaller functions, and small function wrappers, means that sometimes the instruction pointer will dance through various different assemblies visiting tiny helpers and moving on to something closer to where the work will be done. A dancing instruction pointer by definition makes only passing use of the code on any given virtual memory page – more pages are visited, requiring more TLB entries for good performance. Large precompiled assemblies need to be well tuned to get good density or they will suffer from TLB misses (not to mention extra loading time)

Issue #4: Memory Traffic

If you think about the processors main memory in pretty much the same way as you think about a disk drive you are likely to be well served.

When the processor is going to read memory it is very useful if it can begin fetching that memory long before it is needed – prediction is a great thing. However processors have limits as to how many fetches they can have outstanding at any given moment. The operating system uses similar strategies to try to pre-fetch disk sectors it thinks you are highly likely to need soon.

When the processor is going to write to memory it is useful to defer those stores to a time when the memory system is less busy, maybe even idle. However processors have limited internal memory to keep track of these pending stores. The operating system uses a similar approach by deferring writes to your files so that they can be done in the most economical order – or perhaps not done at all (e.g. temporary files might be deleted before they are ever fully written to disk).

Code patterns that do a lot of consecutive (especially dependent) reads perform much worse than code patterns which read a little, think a little, read a little, etc. Likewise for writes – scheduling processor work between writes gives it time to “catch up” effectively making the writes much cheaper.

Managed Code Considerations: Managed code is object-oriented and object-oriented code generally tends to have more indirections than “functional” code. So there’s a penalty to be expected from the outset. Looking past that basic penalty, managed code has things like Write Barriers (see Garbage Collector Basics and Performance Hints ). Extra operations of any kind use those resources. Competing with this is the presence of good primitive data types that are readily useable. These data types encourage developers to make sure of better data structures so you see fewer linked lists and more ArrayLists – that’s good for traffic.

Issue #5: Restrictions Due to Concurrency

If the other sections were brief this section is ultra brief but I want to at least mention this topic to foster some discussion and because it is important.

Anytime you are creating data structures that will be used by multiple threads things get complicated. Considering only the most basic things that must happen from the processors perspective, concurrency support in data structures turns into operations that require the memory subsystem to interact with the caches to stay “current.” These take the form of reads that must observe a write that potentially happened on another processor and writes that must be flushed to all processors immediately. Both of those operations are giant monkey-wrenches in the processors normal order of business.

Managed Code Considerations: Again, staying ultra brief in this discussion, many if not all of the issues managed code faces are also faced by unmanaged code. If there is any advantage to be had it is with the widespread availability of synchronization objects that are well implemented and easy to use (like Monitor.Enter via the lock construct), a highly scalable memory allocator (the GC), and less reliance on concurrency deathnails like Apartment based threading models. All of which can be totally destroyed by one piece of code that uses InterlockedIncrement with wild abandon…

And I think that’s all I have to say on this topic at least for now. I think I’d like to end again by apologizing for the incompleteness of this all. There is much more to say in all of these areas so please don’t make giant conclusions on the basis of this limited primer but rather use it as food for thought.

Comments

Anonymous
March 01, 2005
Rico,
Gotta say - I'm loving all these performance articles. They're interesting for themselves, and also for what they incidentally tell us about the internals.

Particularly loved the previous ones where you did the GC tests.
Anonymous
March 01, 2005
Thanks for the kind words Darren.

I had some thoughts as to what topics might have made a good #6, #7 etc. but maybe it's more useful to see what my readers think should have been #6.
Anonymous
March 01, 2005
Yep, excellent article as per usual Rico.

Cache misses due to internal waste must be very common indeed. How often do you have a collection of objects and want to perform an operation or access some data of each of those objects?

I don't really see any way to combat this problem, besides keeping types as small as possible and limiting loops over objects to where they're absolutely needed (perhaps even combining multiple loops into one). Would you agree?
Anonymous
March 01, 2005
Great article. I love it. But I don't really know how to translate it to 'best practices'. Are there practical techniques (programming in .NET) to avoid these issues as much as possible?
Anonymous
March 01, 2005
"For data, that overhead tends to be dwarfed by locality benefits that come from using a garbage collected memory system."

This should say "a compacting garbage collected memory system". Emphasis on "compacting". A garbage collector that doesn't compact its heap does not get this benefit.

Great article :)

-Ryan
Anonymous
March 02, 2005
That's a good point about "compacting" -- I'm too biased by our own but there are of course other ways to do garbage collection.

Several of the comments directly or indirectly ask how these thoughts might translate into best practices. I think that might make a good companion article. I'll try to get to that soon.
Anonymous
March 02, 2005
The comment has been removed
Anonymous
March 03, 2005
The comment has been removed
Anonymous
March 03, 2005
Now I'm going to disagree with you but before I do that I want to thank you for that lucid comment because there's gems in there and this is a very slippery problem.

Eric writes: "[I]n the real world [locality] is GC's effective Achille [sic] heel"

That hasn't been my observation. If I had to pick the GC's weakness it's in patterns that put it in crisis (see "Mid-life crisis" http://weblogs.asp.net/ricom/archive/2003/12/04/41281.aspx). Customers can and do run into those -- it's something I see when people come to our labs.

In constrast I've yet to observe locality problems attributable to the GC.

On the other hand I have seen plenty of locality issues attributable to pointer spaghetti. That problem is universal though -- no allocator can save you from pointer spaghetti.

But I suspect there have been lots of papers written on this topic. I don't think there is a clear answer that's universally true. I can only speak to problems my customers tend to have or not have.

Finally, you should assume that what I write is biased. Not because I'm being nefarious but because I base my writings on the data that I have and I don't have all the data. It's always good to ask yourself "Are my scenarios similar to the ones that Rico is likely to have seen? Are my scenarios like those of large Microsoft customers?" If they are then it's more likely that my opinion will be relevant to you.
Anonymous
March 03, 2005
The comment has been removed
Anonymous
March 03, 2005
> It's always good to ask yourself "Are my scenarios similar to the ones that Rico is likely to have seen?
> Are my scenarios like those of large Microsoft customers?"
>If they are then it's more likely that my opinion will be relevant to you.

Well, a better question would be: in which scenario does my theory prove correct, ie. the GC does effectively deliver higher locality, as measured with CPU-level cache miss tools?

Don't take it personnally, you wouldn't be the first person believing in GC locality and then discovering how poor it can actually be (it may be higher than that of the Windows Heap Manager, but then, that's no great challenge to begin with).
Anonymous
March 04, 2005
Related to #2: I could have sworn I read somewhere (maybe BradA...I'm too tired to look it up tonight) that due to the cost (assuming stack-related) of throwing an exception, that it is better to prevent one via a conditional than let an exception get thrown. Or maybe I'm misreading it, and you mean that you should use exceptions instead of return values to indicate error. So basically, I'm asking whether the savings of exceptions is in the throwing or in the catching.
Anonymous
March 04, 2005
Be careful now. There is no savings in either the throwing OR the catching. Exceptions always cost you tons.

The savings you get are as follows and they're somewhat indirect.

Suppose you have exotic failure conditions. In regular code you have to check each one in each place it might happen: if (blah) if (blahblah) if (blahblahblah). Those tests are present even though the failure may never actually happen. You always pay the cost of the test. If you use exceptions you don't have those tests. So you win if the exception doesn't happen or happens very rarely.

If the exception happens fairly frequently you LOSE big time. Hence my general rule is, don't use an exception unless the failure happens less than about 1 time in 1000. And more exotic is better still. If the exception is too common then you will lose.

See http://weblogs.asp.net/ricom/archive/2003/12/19/44697.aspx Remove Comment 385718
Anonymous
March 08, 2005
"In constrast I've yet to observe locality problems attributable to the GC."

This is less true if your heap is not in a contiguous space.

Compartilhar via

Common Sources of Processor Performance Penalties: Five Issues

Comments

Recursos adicionais