Larry goes to Layer Court
Two weeks ago, my boss, another developer in my group, and I had the opportunity to attend "Layer Court".
Layer Court is the end product of a really cool part of the quality gate process we've introduced for Windows Vista. This is a purely internal process, but the potential end-user benefits are quite cool.
As systems get older, and as features get added, systems grow more complex. The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.
This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering. The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.
It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the product over time.
One of the major initiatives that was ramped up with the Longhorn Windows Vista reset was the architectural layering initiative. The project had existed for quite some time, but with the reset, the layering team got serious.
What they've done is really quite remarkable. They wrote tools that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components.
These can be as simple as DLL dependencies (program A references DLLs B and C, DLL B references DLL D, DLL D in turn references DLL C), they can be as complicated as RPC dependencies (DLL A has a dependency on process B because DLL A contacts an RPC server that is hosted in process B).
The architectural layering team then went out and assigned a number to every single part of the system starting at ntoskrnl.exe (which is the bottom, at layer 0).
Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1 , the pieces that depend on those (for example, user32.dll) got layer 2, and so forth (btw, I'm making these numbers up - the actual layering is somewhat more complicated, but this is enough to show what's going on).
As long as the layering is simple, this is pretty straightforward. But then the spaghetti problem starts to show up. Raymond may get mad, but I'm going to pick on the shell team as an example of how a layering violation can appear. Consider a DLL like SHELL32.DLL. SHELL32 contains a host of really useful low level functions that are used by lots of applications (like PathIsExe, for example). These functions do nothing but string manipulation of their input functions, so they have virtually no lower level dependencies. But other functions in SHELL32 (like DefScreenSaverProc or DragAcceptFiles) manipulate windows and interact with large number of lower components. As a result of these high level functions, SHELL32 sits relatively high in the architectural layering map (since some of its functions require high level functionality).
So if relatively low level component (say the Windows Audio service) calls into SHELL32, that's what is called a layering violation - the low level component has taken an architectural dependency on a high level component, even if it's only using the low level functions (like PathIsExe).
They also looked for engineering dependencies - when low level component A gets code that's delivered from high level component B - the DLLs and other interfaces might be just fine, but if a low level component A gets code from a higher level component, it still has a dependency on that higher level component - it's a build-time dependency instead of a runtime dependency, but it's STILL a dependency.
Now there are times when low level components have to call into higher level components - it happens all the time (windows media player calls into skins which in turn depend on functionality hosted within windows media player). Part of the layering work was to ensure that when this type of violation occurred that it fit into one of a series of recognized "plug-in" patterns - the layering team defined what were "recognized" plug-in design patterns and factored this into their analysis.
The architectural layering team went through the entire Windows product and identified every single instance of a layering violation. They then went to each of the teams, in turn and asked them to resolve their dependencies (either by changing their code (good) or by explaining why their code matches the plugin pattern (also good), or by explaining the process by which their component will change to remove the dependency (not good, because it means that the dependency is still present)). For this release, they weren't able to deal with all the existing problems, but at least they are preventing new ones from being introduced. And, since there's a roadmap for the future, we can rely on the fact that things will get better in the future.
This was an extraordinarily painful process for most of the teams involved, but it was totally worth the effort. We now have a clear map of which Windows components call into which other Windows components. So if a low level component changes, we can now clearly identify which higher level components might be effected by that change. We finally have the ability to understand how changes ripple throughout the system, and more importantly, we now have mechanisms in place to ensure that no lower level components ever take new dependencies on higher level components (which is how spaghetti software gets introduced).
In order to ensure that we never introduce a layering violation that isn't understood, the architectural layering team has defined a "quality gate" that ensures that no new layering violations are introduced into the system (there are a finite set of known layering violations that are allowed for a number of reasons). Chris Jones mentioned "quality gates" in his Channel9 video, essentially they are a series of hurdles that are placed in front of a development team - the team is not allowed to check code into the main Windows branches unless they have met all the quality gates. So by adding the architectural layering quality gate, the architectural layering team is drawing a line in the sand to ensure that no new layering violations ever get added to the system.
So what's this "layer court" thingy I talked about in the title? Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on. And that's where we were two weeks ago - one of the features I added for Beta2 restored some functionality that was removed in Beta1, but restoring the functionality was flagged as a layering violation. We tried, but were unable to resolve it via email, so we had to go to explain what we were doing and to discuss how we were going to resolve the dependency.
The "good" news (from our point of view) is that we were able to successfully resolve the issue - while we are still in violation, we have a roadmap to ensure that our layering violation will be fixed in the next release of Windows. And we will be fixing it :)
Comments
- Anonymous
August 23, 2005
Larry, thank you for such an in-depth and informative post. It's intriguing to see not just what new features are coming in Windows, but to also see the new processes in place and how they will improve quality.
A question: Is it a layering (or other) violation for a DLL marked as Win32 Console to call into a Win32 GUI DLL? - Anonymous
August 23, 2005
By "the next release of Windows" do you mean "Vista RC1" or "Vista RTM"? - Anonymous
August 23, 2005
The comment has been removed - Anonymous
August 23, 2005
Awesome ! - Anonymous
August 23, 2005
Personally, I found this post by Larry Osterman on 'layering violations'
to be incredibly interesting. ... - Anonymous
August 23, 2005
Thanks for this post, very interesting.
Just a simple question:
why not hal.dll is layer 0? Which depends on the other: ntoskrnl.exe or hal.dll? - Anonymous
August 23, 2005
Awsome. I'd like to someday see that process firsthand. - Anonymous
August 23, 2005
Kazi, good question - the simple answer (as I understand it) is that the layering below ntoskrnl.exe isn't NEARLY as interesting as the layering above ntoskrnl.exe. - Anonymous
August 23, 2005
kernel32.dll depends on ntdll.dll, that depends on ntoskrnl.exe, so "Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1" is wrong, but still it was a very interesting post.
Kazi: They depend on each other
Ivan. - Anonymous
August 23, 2005
I'm a little slow, could you please give another example of where a layer violation would be allowed?
I get why this is a bad thing... I mean cycles are just messy so splitting everything off into a layer and saying "Only communicate down the tree!" seems perfectly logical... But what I don't get is why this would be allowed, ever...
Of course peers should be allowed to communicate, so driver A can use a function in driver B, ditto for libs and or applications. But I can think of no example where a low level lib would be allowed to call something like the graphic lib... - Anonymous
August 23, 2005
Oh, Kazi almost beat me to it, but not quite.
Since hal.dll is layer -1, that puts ntoskrnl.exe at layer -2, which puts hal.dll at layer -3, etc. The layering below ntoskrnl.exe is INFINITELY interesting. - Anonymous
August 23, 2005
Holy ****. I would love to have that code. There is a project I am working on that I bet has serious layering issues. Even on a small project, it can be a very good thing to keep layering in mind. If your classes aren't layered properly, then there probably are a few non-trivial relationships that are just going to cause endless problems. - Anonymous
August 23, 2005
The comment has been removed - Anonymous
August 23, 2005
Hi Larry,
just a dumb question. How does Microsoft
"remember" these architectural flaws
(or left-out features) that are not being
fixed/implemented within the "current"
release but are "moved" to the next one? - Anonymous
August 23, 2005
Larry,
Great blog, very informative reading. Larry, are you planning to post more details on the build process for Windows Vista?
I find very interesting on how you guys manage to pull so many pieces together.
Do you have any figures on the compile time for the builds for Windows Vista?.
What are the specs for the systems that you are you compiling Windows Vista on? - I would expect these machines would have a fair amount of "horsepower" behind them.
- Anonymous
August 23, 2005
thanks for that extremely interesting post!
is there a plan to release these tools to the public at some stage (e.g. as part of the visual studio package)? - Anonymous
August 24, 2005
Larry, when you say next release of Windows, is it Vista or post-Vista? - Anonymous
August 24, 2005
The comment has been removed - Anonymous
August 24, 2005
… for Architects
Nick Malik – Enterprise Architecture Agility
Roy Osherove – [Audio Interview] Ingo... - Anonymous
August 24, 2005
The comment has been removed - Anonymous
August 24, 2005
The build machines are fairly beefy machines, the ones we have in my group (we have 2) cost about $2500. - Anonymous
August 24, 2005
Any chance of those tools someday finding their way into fxCop??? :D - Anonymous
August 25, 2005
Kazi: The simple answer is that it doesn't really matter. From the standpoint of anyone else's code in the system, the kernel and HAL are binary stars that power the solar system. There's almost no reason for a driver to take a direct dependancy on the HAL, since HAL mechanisms that they might need are exposed or dynamically assigned by kernel routines. Just think of the kernel and HAL as the same thing, from the "layering" perspective. - Anonymous
August 26, 2005
The comment has been removed - Anonymous
August 31, 2005
The comment has been removed - Anonymous
September 04, 2005
The comment has been removed - Anonymous
September 24, 2005
Thank you for the inisght, very interesting. - Anonymous
March 15, 2006
Well, this year I didn't miss the anniversary of my first blog post.
I still can't quite believe it's... - Anonymous
November 05, 2007
After I posted my article on the SHAutoComplete , I mentioned it to one of my co-workers. His response