Anatomy of a software bug, part 1 - the NT browser

No, I don't mean that the NT browser's a software bug...

Actually Raymond's post this morning about the network neighborhood got me thinking about the NT browser and it's design.  I've written about the NT browser before here, but never wrote up how the silly thing worked.  While reminiscing, I remembered a memorable bug I fixed back in the early 1990's that's worth writing up because it's a great example of how strange behaviors and subtle issues can appear in peer-to-peer distributed systems (and why they're so hard to get right).

Btw, the current design of the network neighborhood is rather different than this one - I'm describing code and architecture designed for systems 12 years ago, there have been a huge number of improvements to the system since then, and some massive architectural redesigns.  In particular, the "computer browser" service upon which all this depends is disabled in Windows XP SP2 due to attack surface reduction.  In current versions of Windows, Explorer uses a different mechanism to view the network neighborhood (at least on my machine at work).

 

The actual original design of the NT browser came from Windows for Workgroups.  Windows for Workgroups was a peer-to-peer networking solution for Windows 3.1 (and continued to be the basis of the networking code in Windows 95).  As such, all machines in a workgroup needed to be visible to all the other machines in the workgroup.  In addition, since you might have different workgroups on your LAN, it needed to be able to enumerate all the workgroups on the LAN.

One critical aspect of WfW is that it was designed for LAN environments - it was primarily based on NetBEUI, which was a LAN protocol designed by IBM back in the 1980's.  LAN protocols typically scale quite nicely to several hundred computers, after which they start to fall apart (due to collisions, etc).  For larger networks, you need a routable protocol like IPX or TCP, but at the time, it wasn't that big a deal (we're talking about 1991 here - way before the WWW existed).

As I mentioned, WfW was a peer-to-peer product.  As such, everything about WfW had to be auto-configuring.  For Lan Manager, it was ok to designate a single machine in your domain to be the "domain controller" and others as "backup domain controllers", but for WfW, all that had to be automatic.

To achieve this, the guys who designed the protocol for the WfW browser decided on a three tier design.  Most of the machine on the workgroup would be "potential browser servers".  Some of the machines in the workgroup would be declared as "browser servers", one of the machine in the workgroup was the "master browser server".

Client's periodically (every three minutes) sent a datagram to the master browser server, and the master browser would record this in it's server list.  If the server hadn't heard from the client for three announcements, it assumed that the client had been turned off and removed it from the list.  Backup browser servers would periodically (every 15 minutes) retrieve the browser list from the master browser.

When a client wanted to browse the network, the client sent a broadcast datagram to the workgroup asking who the browser servers were on the workgroup.  One of the backup or master browser servers would respond within several seconds (randomly).  The client would then ask that browser server for its list of machines, and would display that to the user.

If none of the browser servers responded, then the client would force an "election".  When the potential browser servers received the election datagram, they each broadcast a "vote" datagram that described their "worth".  If they saw a datagram from another server that had more "worth" than they did, they silently dropped out of the election.

A servers "worth" was based on a lot of factors - the system's uptime, the version of the software running, their current role as a browser (backup browsers were better than potential browsers, master browsers were better than backup browsers).

Once the master browser was elected, it nominated some number of potential browser servers to be backup browsers

This scheme worked pretty well - browsers tended to be stable, and the system was self healing.

Now once we started deploying the browser in NT, we started running into problems that caused us to make some important design changes.  The biggest one related to performance.  It turns out that in a corporate environment, peer-to-peer browsing is a REALLY bad idea.  There's no way of knowing what's going on on another persons machine, and if the machine is really busy (like if it's running NT stress tests), it impacts the browsing behavior for everyone in the domain.  Since NT had the concept of domains (and designated domain controllers), we modified the election algorithm for to ensure that NT server machines were "more worthy" than NT workstation machines, this solved that particular problem neatly.  We also biased the election algorithm towards NT machines in general, on the theory that NT machines were more likely to be more reliable than WfW machines.

There were a LOT of other details about the NT browser that I've forgotten, but that's a really brief overview, and it's enough to understand the bug.  Btw, I'm the person who coined the term "Bowser" (as in "bowser.sys") during a design review meeting with my boss (who described it as a dog) :)

Btw, Anonymous Coward's comment on Raymond's blog is remarkably accurate, and states many of the design criteria and benefits of the architecture quite nicely.  I don't know who AC is (my first guess didn't pan out), but I suspect that person has worked with this particular piece of code :)

Comments

  • Anonymous
    January 11, 2005
    Silly just-barely-relevant anecdote:

    In the early 90's, while I was in the NetUI group, I wrote the original version of the "Services" applet. One day, I had a low-priority bug filed against me for a "typo" in the UI. Someone was confused about the name "bowser".

    I resolved the bug as "by design". The subsequent conversation went something like this:

    Tester: How can a typo be "by design"?
    Me: I just get the list of services from the Service Controller and display 'em.
    Tester: But, why "bowser"?
    Me: Go ask LarryO.
  • Anonymous
    January 11, 2005
    Keith, what part of the services applet displayed device driver names? The browser service was always "Computer Browser"

    And they must have had a field name with the "Anciliary Function Driver" driver.
  • Anonymous
    January 11, 2005
    What I would really like to know is what the XP SP2-firewall does to the browser and the networking:

    Will it let broadcasts in? Will it get any answer from the master browser?

    And how does this work on SP2? ("In current versions of Windows, Explorer uses a different mechanism to view the network neighborhood (at least on my machine at work) ")


    Is it true that a XP-computer in a Active Directory will show its network neighbourhood with LDAP-queries?
  • Anonymous
    January 11, 2005
    Christian:
    XP SP2 disables the browser. Nobody's listening, so no broadcasts get let in.

    For XP SP2, I don't know how it works. Maybe someone who reads this does, but I don't (I haven't worked on the browser for over ten years now, and I know that lots of stuff has changed). For the domain case, it may very ask the DC via ADSI, I just don't know.

  • Anonymous
    January 11, 2005
    Whoops, my bad. There was also the "Drivers" applet. The "Bowser" (driver) vs "Browser" (service) confused the tester. And me too, apparently :)
  • Anonymous
    January 11, 2005
    I have Service Pack 2 installed, and the "Computer Browser" service is started and set to automatic.
  • Anonymous
    January 11, 2005
    G. Man: Don't know - on two of my three machines, it's disabled on the 3rd it's not.

    It may have to do with domain policies here at MS.
  • Anonymous
    January 11, 2005
    'I have Service Pack 2 installed, and the "Computer Browser" service is started and set to automatic.'

    Ditto, on 3 machines. I have no DC at home (and thus no policies that may have turned it back on) and all three installs were done clean from a slipstreamed XP2 disk.
  • Anonymous
    January 11, 2005
    "...it's disabled on the 3rd it's not."

    Larry: I thought I read somewhere that it depends on whether or not shares exist when sp2 is installed.
  • Anonymous
    January 11, 2005
    Baby Boy: Could be. As I said, I'm not sure what's happened with the browser these days.
  • Anonymous
    January 11, 2005
    I think that in corporate environments it is common to disable the computer browser service on clients by policy, in order to avoid broadcast storms.

    On my home machines, I've occasionally but rarely experimented with having a domain controller, and occasionally and less rarely depend on the browser service working. I've seen 3-minute delays more frequently than 15-second delays.

    Last weekend I experimented with a slightly different workgroup, one between a physical PC running VPC 2004 and a virtual host using a loopback (virtual) adapter. They can mount each other's shares by assigning a drive letter but don't see each other in the network neighbourhood. Haven't experimented enough yet to know if the bug is in me or somewhere else.
  • Anonymous
    January 11, 2005
    The comment has been removed
  • Anonymous
    January 11, 2005
    Gawd, that's a blast from the past. I remember manually configuring NT workstations to NOT even participate in elections.
  • Anonymous
    January 12, 2005
    If my memory serves me, when we were writing MS-NET 2.0 in 1985, long before NetBeui and a little before Windows, we wrote a multi-tasking MS-DOS server. Yes, there was a multi-tasking MS-DOS release. Rather than the WfW model of tiered servers, we obtained network directory information by datagram query. I don't think we cached that information anywhere. You ask who's around, everyone hollers back, you throw away the results. It was cheap and moderately fast on small LANs. I think you took over that code, didn't you, Larry?
  • Anonymous
    January 12, 2005
    Yup, 100% right David. Actually I didn't take over the server, it morphed into the Lan Manager server, but...

    Oh, and MS-NET and friends were all based on NetBEUI (and used NetBIOS as their API level).

    Actually in Lanman, every server broadcast their data, and every client listened. It only scaled for domains of about 30-40 servers, which is why WfW went to the client/server model.
  • Anonymous
    January 12, 2005
    Btw, David - How're you doing these days? Haven't heard from you in a while :)
  • Anonymous
    October 16, 2007
    Also known as "Larry mounts a DDOS attack against every single machine running Windows NT" Or: No stupid