Background and Foreground GC in .NET 4
Another interesting new feature of the CLR 4 comes from the Garbage collection team. On this version, they are adding some performance enhancements on the memory allocation process. The feature is commonly called “Background GC”. But what does it actually mean?
As applications are starting to consume more memory and some of them moving to wider memory spaces under 64bits processes we have started to see some latency issues while allocating memory when the full GC is running. As you may remember, for workstation version of the CLR we use a concurrent GC. This means that the GC thread will run in parallel without blocking the application execution (well, we try to minimize the blocking time). This thread will scan Gen 2 in order to mark dead objects. This operation can take some time if the memory allocation is quite large and this prevents ephemeral collections while is running. Ok, now you may be asking what does it mean?, let me explain it with some graphics.
Let’s analyze how the current concurrency GC works and the scenario that we are improving:
Now, our application needs to perform a full GC, for this it will scan Generation 2 and try to mark the dead objects as free objects. This is executed by the GC thread, the simplified steps that it will take are the following ones:
1) It will start marking the objects, checking the stacks and the GC roots. This operation will allow further allocations, this means that your application may create a new object and this will be allocated in generation 0.
2) Now there are further allocations that the GC needs to suspend the EE (Execution engine) and this will stop all threads on your application. At this stage no allocation is allowed and your application may suffer some latency.
3) The EE is resumed in order to continue working on the heap and other bits and pieces that the GC needs to handle; at this stage the allocation is allowed. But what happen if our ephemeral segment is full while this collection happens?
4) At this stage the ephemeral collection cannot swap segments and the allocation will be delayed, adding latency to your application.
As you can see, the problem is that a single GC thread cannot cope with those two operations at the same time. The current ephemeral segment is 16mb (note that this may change in the future so don’t relay on this value!). This means that you can only allocate up to 16mb or whatever is available at the time of allocation, and as you have seen on the example this space may run out before the GC collection finishes! I hope now you understand why we don’t recommend you to call GC.Collect() without a good reason J.
Ok, now let me introduce you to the background GC. This model has been optimized to reduce the latency introduced by the scenario described above. The solution came from the idea of creating a background GC that works as it has described above and a foreground GC that will be only triggered when the ephemeral segment needs to be collected while performing a generation 2 collection.
Now, if we repeat the scenario above and we try to allocate memory on the ephemeral segment while the background GC is marking the foreground GC will execute an ephemeral collection:
The ephemeral foreground thread will mark the dead objects and will swap segments (as this is more efficient rather than copying the objects to generation 2. The ephemeral segment with the free and allocated objects becomes a generation 2 segment.
As you can see now the allocation is allowed and your application will not need to wait for the full GC to finish before allowing you the allocation.
Note that this enhancement is only available on workstation, as the server version is a blocking GC per core and we didn’t have enough time to port these enhancements into it, but is definitely in our plans in the near future. We have tested this solution in 64 cores but our future objective is to hit the 128 core mark as the SQL team has provenJ.
I hope this blog post makes a bit clearer how the GC works and what kind of enhancements we are including in .NET 4.
Comments
Anonymous
June 22, 2009
Are there any plans to make the ephemeral heap size configurable?Anonymous
June 22, 2009
Hi John, no, there are not inmediate plans for it. Let me explain you why. On each version we review how we can optimize the ephemeral segment size, at the moment is 16mb, but this can change depending if you are running 32bits or 64bits. As you can see from the post we swap segments on the collection (rather than copying the objects) making the Gen2 n(16) bigger. If you just adjust this size ranmdonly it may make the allocation slower, as segment size is essential for us in order to keep fragmentation to minimum. Maybe if you can explain in which scenario this is necessary we can work together in a solution or propose it for future versions. Best regards, SalvaAnonymous
June 23, 2009
I'm just trying to come up with the best performance scenario for an application that has a very large memory footprint. I really almost need the GC to run in server mode to take advantage of my 8 cores to sweep through my Gen 0/1 GC's really quickly. But the Gen 2 garbage collection times I am seeing are unacceptable for my application, which has me pressing for background GC. I'm trying to find the right balance to eliminate latency and take advantage of my server configuration. The single 16 meg address space has my Gen 0/1 GC's kicking off so frequently it's not letting my application keep up with responses. I am fine in Server mode with keeping up (8 16 meg address spaces), but the occasional Gen 2 GC is way too long.Anonymous
June 24, 2009
Are you running your application on a client or a server (you said that "I really almost need the GC to run in server mode). If is a client one you can change the configuration to use the server mode on client applications: <runtime> <gcServer enabled="true" /> </runtime> Now, if you are at server mode and your problem is the generation transition you can allocate a large array instead of allocating mini-objects. Remember that large objects will go straight to the Large Object Heap (LOH) and won't be sweep. Also there is a question around why you are doing full collections, ephemeral collections are usually really fast (due segment switch), only Gen2 collections may be slow if they are too big, but this should only happen under memory pressure (or are you calling GC.Collect()?) Thanks, salvaAnonymous
June 24, 2009
The comment has been removedAnonymous
June 24, 2009
That is really interesting, we are glad that the new background GC is making some improvements on your application. Unfortunately we couldn't release the background GC for the server edition due lack of time but is definitely on the roadmap, maybe in a service pack but we don't know yet. Server mode would use a foreground and a background thread per core really speeding up the operations (at least on our initial tests). You can get a notification when you are about to be collected by the Gen2, it will be interested to see which scenario is triggering the Gen2 collection. How much memory your application is using? Also, is this running in 64 or 32 bits?Anonymous
June 29, 2009
Hi, Is there any plan to support x32 NUMA architectures with more than 8 cpu's? I've heard there is a limit on GC above this limit. Can you confirm? RegardsAnonymous
July 01, 2009
Hi Otavio, The GC is not aware of the NUMA architecture at all. Regarding the 8 cores limit this applies to all architectures as we don't balance the heaps when we have more than 8 cores. That is the current limit as shown on performance test. We are working on this for future versions. Hope this helpsAnonymous
August 23, 2009
Salva, I read this article with great interest. Can you confirm my understanding: In CLR 2.0, when an ephemeral segment runs out of memory during a Gen 2 GC, that introduces additional latency because the ephemeral segments cannot be swapped during a Gen 2 GC. The 4.0 CLR removed this limitation. I work on an application where the lowest latency is of utmost importance. I understand that ephemeral GCs are always blocking. Can you tell me what factors affect the extent of the blocking? In other words, if I feel like I have too much blocking, what should I look for in my code? Thanks in advance -SevAnonymous
August 24, 2009
Hi Sev Z. You are right, the CLR 4.0 removes this limitation only on workstations, this is not the case for the server version of the CLR (due lack of time), this will be introduced on our next release. Regarding your query, the first thing that you need to look on your code is if you explicitly call any collection (i.e GC.Collect()). The latency happens when the ephemeral segment runs out of memory, triggering a Gen 0 / Gen 1 collection. Gen 2 collections are not very common unless you have memory pressure. One way to overcome this when you use many small objects is to work with an object container (like an array), big enough to go to the large object heap, as it will sit by the Gen 2 and won't suffer from collection. Applications like the process explorer from sysinternals can tell you the amount of collections on each generation, if you have too many gen 2 (full collections) it means that you may be running out of memory. Ephemeral collections just does a segment swap and latency is minimum. Hope this helpsAnonymous
August 25, 2009
Salva, Thanks for the timely response! We do not explicitly call GC.Collect(). We're getting Full GC every 2-3 minutes. We're using workstation mode. We're allocating 2-3 MB per second, but it's all short-lived data. The processes are long-running and use under 100 MB of memory. We're processing a lot of events. Processing an event requires a bit of memory to be allocated, but after the event is processed, the memory is completely reclaimed. Hardware is quad core with 8 GB RAM. There is plenty of free memory. How can I tell with certainty if my ephemeral segment is running out of memory? Based on what you are saying it sounds like under ordinary circumstances, the ephemeral collection blocks the application for a small amount of time that is fairly constant. -SevAnonymous
August 25, 2009
The ephemeral segment is at the moment 16mb, so is easy to run out of space in generation 0 and generation 1. But this should be fast enough. The only delay ocurrs when there is a gen 2 collection, but this is triggered by memory pressure (90% of the time unless you are collecting). Are you using the 64 bits version of the CLR?, if so the GC Heap should get plenty of memory from the Virtual memory manager and no collections should be necessary. The blocking time from the Gen 2 collection will increase based on the Gen 2 size and granularity. Some workarounds:
- If you are using blitable objects as fast allocators you can get around using an array allocated on the stack using stackalloc, this usually works fine but only if the object is blitable.
- If you are using custom objects, my recomendation will be to create a large object like an array and allocate it in the Large Object Heap, and then reuse the same memory spaces, this is very common on native applications and works well in the managed world.
- Avoid promotion of short lived objects, this is very common scenario when a long lived objects reference to a short live, this will produce a promotion to the long lived generation, triggering a collection (I believe that this is what is happening in your scenario)
- If you need to know when a full collection will ocurr, you can be notified by the CLR when this is about to happen, this scenario is good to take the necesary measures to prevent your application from failing. Use RegisterForFullGCNotification
- If you are using a multicore machine what you can do is to disable the concurrent GC (this is by default on the workstation mode), as each core will have its own GC and WILL block all the allocations to that GC. If this is your case, you can enable the concurrent GC (http://msdn.microsoft.com/en-us/library/yhwwzef8.aspx) and see if it works better. I hope this helps, Salva