Auditing system impact on performance
UPDATE 2010-06-06 (EricF) - Fixed Vista+ architecture image; link was broken on migration to new blog platform
I get questions from time to time, such as my recent offline question from Steve, about what performance impact auditing has on the system as a whole.
To answer this you need to understand a couple of things:
- Auditable activity is implemented as instrumentation (e.g. a function call to the auditing system) inside the code that does something auditable.
- The auditing system in Windows has two sets of programmatic interfaces for introducing an event, one in kernel mode and one in user mode - so the component generating audit does not need to switch between kernel and user modes.
- Although audit policy is stored in user mode, we cache a copy of the relevant policy for kernel-mode components, in kernel mode. This means that no mode switch is necessary to check audit policy to decide whether to generate an event.
- There are user-mode and kernel-mode queues for audit events. The call to generate an audit event actually just queues the event, assuming the queue is not full. So from the perspective of the component generating audit, audit has an "asynchronous" flavor under light-medium loads. Under heavy loads when the queues fill, audit blocks the component raising the audit until the event can be queued, showing its true synchronous behavior.
- Dequeuing audit events always occurs on a separate thread than enqueueing so that raising audit events and writing them to the log don't affect each other's perf under light to moderate load.
- The pre-Vista auditing system in the kernel delivers events to LSA. The Vista+ auditing system in the kernel delivers most events directly to ETW, the kernel mode event trace engine, which means that most of the kernel audit (including the potentially perf impacting object access events) doesn't require a mode switch at all.
- The LSA formats events and then delivers them to the event log. In WS03, events are batched in the RPC call to eventlog. In Vista+, delivery is done by means of ETW in almost all cases.
- ETW queues events and spools them to the Windows eventlog service as fast as the service will accept them.
- The eventlog service writes the events to the log file as they arrive.
I have uploaded graphics of the Windows XP/Windows Server 2003 auditing architecture, and the Windows Vista/WS08/Windows 7 architecture, to make this process more clear:
Windows Vista+ Auditing Architecture
So now back to the original question- what is the impact of auditing on performance?
At low auditing loads, auditing generally has no discernable impact on perf. If you were hardcode with a profiler and iterated an auditable activity a million times I am sure you'd be able to measure it, but for reasonable audit policies you won't notice a significant difference.
At high auditing loads, auditing has a significant performance impact. This is more true of pre-Vista multiprocessor systems than of systems with the new eventlog system.
For example, a multi-processor domain controller (say a 32-processor box) running Windows Server 2003, might run into problems under extreme load. Why is that? Because ultimately the limiting factor on event rate is how fast you can write the events to disk. Pre-Vista eventlog has a single thread writing events to disk. So even though you might have 32 threads servicing authentication requests (an auditable activity), each of them is queueing to a single audit queue which is ultimately despooling to eventlog via RPC on a single thread, and eventlog is only writing to the security log with a single thread. What we observe in practice in this case is that a single processor on the system goes to 85-100% utilization, and the other processors drop to a very low utilization as the authentication threads are blocked waiting for the audit function call to return. This call won't return until the queue is not full, and the queue is waiting on RPC which is waiting on eventlog... so eventlog governs the rate.
In Windows Server 2003, we added a particular optimization only for the security event log, which batches events in the RPC call to eventlog. This means that you can get more event throughput in the security log than in other logs on the system. It didn't eliminate the bottleneck, but it pushed back the limit, so WS03 on typical hardware should be able to log several thousand events per second to the security event log. Previous versions were only able to log about 1000 events per second.
Note that the change in performance characteristics occurs all at once. So the impact tends to be trivial until the queues fill, at which point the impact is severe. It does not scale linearly, there's a discrete behavior change. What this means realistically is that if you ever encounter a performance problem with auditing, then you probably just need to turn it down a little and you won't have a problem any more.
In Vista and subsequent releases, audit queues events via ETW. ETW was designed for high-performance kernel tracing, and in the auditing team we tested it to over 10,000 (10.000 for you folks in Germany :-) events per second before we decided that we had hit our scale targets. We never tested exactly how high it would go, but we were satisfied that the eventlog service was no longer a bottleneck in realistic scenarios.
There are some edge cases where you might run into performance problems by trying to audit too much in a critical path. For instance, it is a really really bad idea to put SACLs on your entire registry. If you monitor registry activity with a tool like Process Monitor, you will notice that when a system is not idle, there are often hundreds or thousands of registry accesses per second. If you impose an auditing tax on each of those activities you will notice a degradation in performance. Not to mention that the resultant mountain of events is probably not very valuable. Of course you can tune SACLs as I have mentioned before, but I doubt that it's useful to take the time to tune SACLs for the entire registry.
One last point is that the eventlog is writing the events somewhere. Wherever it is writing events, it is consuming disk I/Os and competing with anything else writing to the same volume. If you have a disk performance problem on that disk, it can result in an auditing performance problem, as everything else will back up if the eventlog can't write events to disk fast enough. So one thing you can do is ensure that the disk where your log is placed has enough I/Os.
In summary audit has very minimal impact unless you do a whole lot of it, in which case it can have severe impact on your system. The change happens suddenly, not gradually, so you can do a lot of auditing with no problem. If you run into a problem, turn it down just a little (or little by little) and at some point the behavior will change such that you won't have any significant perf impact anymore.