Architectural thinking

In my last blog I hypothesised that Architectural analysis is slightly different from developer analysis and so needs a subtly different skill set and way of thinking. To demonstrate what I mean let me describe a real life example of an architectural problem and different solutions.

I was called up late one Friday afternoon (why is it always Friday these things happen?) by a distraught business manager who’s biggest customer had a problem (I wont say who it was but they are household name in the UK). They had decided to provide a new product over the telephone and so had built a customer / order processing system for a maximum of 200 telesales operatives using Microsoft Products. They were going live on the Monday when the new product launched (and that wasn’t going to be easy to stop!). They had been in stress test for 3 weeks and when they took the load up to 50 users the system crashed. They had tried to fix the problem and couldn’t so it must be Microsoft’s fault, after all they had read in the press that Windows didn’t scale and here was proof! The Business Manager wanted me at their office (a 3 hour drive) asap, not so much to fix the problem but more to show that we were doing something. It seems to be a common misconception that putting technical people in cars or trains is a valuable use of their time which I vigorously dispute, I feel that most problems can be solved more quickly over the phone. There was a short discussion about efficient problem solving techniques, he spoke to my manager and I was in the car. Why is it my life is so like a Dilbert cartoon?

Getting to the customer a disaster scene met my eyes, paperwork everywhere, empty coffee cups, red eyed technical people, irascible managers, phones ringing, you know the sort of thing. The technical people just wanted me to say it was our technologies fault so they could go home. Managers wanted to nail me to a whiteboard and take turns with the whip; fun all around, however as I am not into S&M I insisted on looking at the application first (so maybe I am, just not that sort!).

It was a simple 3 tier app, smart client, business tier doing some business processing and a database with some simple stored procedures isolating the data access; nice and simple. There was however one strange thing; they had a second server running a piece of the business logic alongside the main business server. I asked why this was and it transpired that they had profiled the application (nice but unusual in my experience) and found one piece of code which was doing some simple customer validation and generating a GUID was taking about 30% of the CPU. They were concerned that it would become a bottleneck so had come up with the idea that, as the application was very well modularised, they could put that code on a second server and so distribute the load. They knew all about scale out.

 

The problem was that when the load got to 50 users the network stack on the server overflowed and so the system crashed. They had been on to product support and got patches to increase the network stack size (something I didn’t even know you could do!) but of course that didn’t fix the problem. Because it seemed to be something in the network layer they had spent ages in network tuning, putting in faster Ethernets and hubs etc. They were now convinced that it was an OS problem and Windows wasn’t scalable so why didn’t I admit it and let the blame fall on MS.

This is not a great career move at Microsoft and anyway I thought I knew what the problem was. I suggested a quick rebuild of the application with a simple change and then a retest whilst I went and moved the car (I had left it on double yellows). By the time I got back they had done the modifications, stress tested and were able to meet the 200 user criteria easily (which either shows how productive our platform is or how difficult it is to find a parking place in the UK!). Congratulations all round, techies treating me like a guru, senior managers fetching me coffee and a much relived business manager who carried my bag to the car, sometimes I love this Job!

So four questions:

1 What was causing the problem?

2 Waht was the fix?

3 How should it have been architected for scalability in the first place?

4 Why do I hate marketing messages?

Answers in the feedback

Comments

  • Anonymous
    February 06, 2004
    1 The GUID code on the separate server was hardly doing anything and being called every time any customer access happened. It took a lot longer to get through the network stack than it did to run the code so the network became the bottleneck
    2 To put the GUID code back on the business server.
    3 The GUID code should have been put in the client
    4 The message "Just distribute the code and that gives you scalability using scale out" is not very smart.
  • Anonymous
    February 06, 2004
    I dont envy your job Michael.. ;)
  • Anonymous
    February 06, 2004
    one thing I don't understand, in your post you said "I suggested a quick rebuild of the application with a simple change and then a retest...", but then you said that the solution was to move the GUI code back to the server, I guess GUID code was in a DLL and that DLL could have been moved to the business server without recompiling, wasn't it ?

    Why the need to recompile ?

  • Anonymous
    February 06, 2004
    The comment has been removed
  • Anonymous
    February 06, 2004
    Nice post. I've been in similar situations myself, though perhaps not as much a as a do or die situation.
  • Anonymous
    February 06, 2004
    Do you think that their profiling was "wrong" since you said "GUID code on the separate server was hardly doing anything"??
    This is a very good post Michael. We can learn a lot from your experiences like this. Keep bloggin...
  • Anonymous
    February 06, 2004
    The comment has been removed
  • Anonymous
    February 06, 2004
    The pofiling was correct, their solution was wrong.

    The GUID was a UUID so unique. The clients were all terminals for internal telesales reps so inside the firewall and locked down.
  • Anonymous
    February 06, 2004
    A little knowledge is a dangerous thing.

    The problem is that the world is littered with a little knowledge all over the place.
  • Anonymous
    February 11, 2004
    But always remember that in some areas you will be the one with a little knowledge :).
  • Anonymous
    February 13, 2004
    supposedly they moved the GUID code OFF the main server in the beginning... so they moved it back to the main and it fixed it?

    whats going on here?
  • Anonymous
    February 13, 2004
    The comment has been removed
  • Anonymous
    February 16, 2004
    I wasnt quite clear when I wrote this, it was 30% of the cpu usage by the application, not total. The app with the profile test was only about 30% of the total cpu, the rest being system idle.
    Actually even though this was a simple app I think there were problems in the UUID code to get to this level of utilisation.
  • Anonymous
    February 29, 2004
    The thing I find in my experience, is that if you had you offered that suggestion at the beginning of the project they would have argued and argued that it wouldn't work. Only when customers get in a real bind do they really start to listen, because only at that point do they truly understand that they don't have all the right answers.
  • Anonymous
    June 14, 2004
    My question is, why didn't the business manager make sure an MCS person was engaged at the beginning of the project to design the archtecture correctly so that customer would be successfull without resorting to heroics?

    Though it is nice to be a hero.