Share via


ACS' first bug from being too performant

We got several reports recently of a bug in ACS that certain DS Access events, primarily for dnsNode and dnsZone objects, don't properly get looked up.

Some background: the event log in Windows prefers to log invariants such as message IDs, parameter message IDs, SIDs (security IDs which represent users and groups, etc.), and GUIDs (globally unique IDs which represent objects in Active Directory), rather than the actual names of the objects.  At view time the viewing application is expected to look up the name associated with the invariant and display it to the user.

The reasons that Windows does this are (1) that it enables localization, so that English speakers can see "Administrator" and French speakers can see "Administrateur", and (2) that it provides rename safefy- many objects are rename-able, such as domain accounts and other AD objects.

Anyway in ACS we had to solve the problem of how to store mountains of log data in a database, make it queryable in meaningful ways, preserve original format, present to users in a recognizable/understandable format, etc.

The way we chose to solve our several problems was to take strings that contained an invariant and append the translated name or string.

For example:
%{e0fa1e8c-9b45-11d0-afdd-00c04fd930c9}
would be translated to:
%{e0fa1e8c-9b45-11d0-afdd-00c04fd930c9}=”dnsNode”

and
%%7685
becomes:
%%7685=”Write Property”

As I mentioned, though, we ran into a problem recently.  Some of our customers were monitoring AD objects with ACS and noticed that ACS was not translating the GUIDs for certain objects.  When they manually looked up the GUIDs they noticed that they were for AD-integrated DNS objects.

After investigation, we found that AD was logging certain audit events for the objects, before all the attributes of the objects had been populated- DNS was populating the objects in multiple operations per object and each operation causes a separate event.  So ACS, which operates as close to real-time as we could get, was actually noticing the first event and asking AD "what's this?" before DNS had finished updating AD with things like the object's name.  The difference in time was literally only milliseconds.

Anyway I didn't really feel it was an ACS bug and wanted to file a bug against Windows DNS Server.  However the Operations Manager team has prototyped a configurable behavior for the ACS agent that lets it wait a very short time (configurable number of ms) and retry, when it fails to look up an AD object because the object doesn't exist.  This might be released as a public patch and/or in a future Service Pack.

I thought you might appreciate stories of the kinds of weirdness we run into.