So what happened?

Now that we have had time to dive into the details more on what happened that caused the temporary problems last weekend, I want to share what we found so far. We still have more work to do, but we now understand what occurred and why. We've also implemented several changes to prevent this kind of thing from happening again.

What exactly happened?

Two key things happened. First, activations and validations were both affected when preproduction code was accidentally sent to production servers. Second, while the issue affecting activations was fixed in less than thirty minutes (by rolling back the changes) the effect of the preproduction code on our validation service continued after the rollback took place. 

How did this happen in the first place?

Nothing more than human error started it all. Pre-production code was sent to production servers. The production servers had not yet been upgraded with a recent change to enable stronger encryption/decryption of product keys during the activation and validation processes. The result of this is that the production servers declined activation and validation requests that should have passed.

Why did it take so long to fix?

While the response to the activation issue was quick (less than thirty minutes) the effect on our validation service continued even after the rollback took place. We expected the rollback to fix both issues at the same time but we now realize that we didn't have the right monitoring in place to be sure the fixes had the intended effect.

If the servers are down, why don't you just assume the systems are genuine?

We do. It's important to clarify that this event was not an outage. Our system is designed to default to genuine if the service is disrupted or unavailable. In other words, we designed WGA to give the benefit of the doubt to our customers.  If our servers are down, your system will pass validation every time. This event was not the same as an outage because in this case the trusted source of validations itself responded incorrectly.

What changes have you made?

We have implemented several changes to address the specific issues that took place over the weekend - for example we are improving our monitoring capabilities to alert us much sooner should anything like this happen again. We're also working through a list of additional changes such as increasing the speed of escalations and adding checkpoints before changes can be made to production servers.

Why were some customers told that this problem might continue for days?

As I mentioned in my post yesterday, we erroneously said the servers might be down until Tuesday, when in fact they had already been fixed as of late Saturday morning Pacific Time.  We're reviewing our procedures on that score as well - communicating clearly and accurately are super important when things like this happen.

What were customers experiencing?

For the customers who failed validation from Friday afternoon through Saturday morning the experience was that features we refer to as ‘genuine-only' features were disabled. These features are Windows Aero, Windows ReadyBoost, Windows Defender (in this state Defender will scan and identify all threats it would ordinarily, but will only clean ones marked ‘severe') and Windows Update (in this state only ‘optional' updates are unavailable, all others can still be downloaded, including security updates). Also a desktop message appears in the lower right hand corner of the desktop area. The message reads ‘This copy of Windows is not genuine' and the message is persistent until a successful validation is performed and the message goes away.

The form of validation failure experienced by those affected on late Friday and early Saturday DID NOT result in the beginning of the 30-day grace period during which activation is required. Nor was there any 3-day period during which a customer was required to do anything related to this issue. Disabling the genuine-only features is meant to provide notice to the customer of the state of the system. When disabled, the features present their own error messages relating to the system not being genuine. It's unfortunate this happened to users with genuine systems.

 

I also want everyone to know that I am personally very disappointed that this event occurred. As an organization we've come a long way since this program began and it's difficult knowing that this event confused, inconvenienced, and upset our customers. 

As always, please send your feedback to me through the blog (you can use the email link in the upper left hand corner of the page) or post comments.

Thanks,

Alex

Comments

  • Anonymous
    August 28, 2007
    Thanks for 'coming clean' about the problem; I know that takes a lot of guts to do, even though once it's done, it seemed so simple. I highly encourage more transparency in both the WGA and Activations efforts at Microsoft. I'm pretty sure it will do you guys a lot of good in the long run. There are still many people with grave concerns as I'm sure you know! But, again, a hearty slap on the back for honestly admitting the slip-up.

  • Anonymous
    August 28, 2007
    It was hard to miss the news about the WGA (Windows Genuine Advantage (?)) outage Microsoft had this weekend. Just in case you managed it somehow, you might want to catch up on it in this Computerworld article. Microsoft’s Windows Genuine Advanta..

  • Anonymous
    August 29, 2007
    The comment has been removed

  • Anonymous
    August 29, 2007
    The comment has been removed

  • Anonymous
    August 29, 2007
    It's time to end this nonsense known as WGA.  Enough is enough.  

  • Anonymous
    August 29, 2007
    I commend you for figuring this out, and posting the postmortem.  That kind of transparency is hard to do, but ultimately it is the right thing to do. None the less, a legitimate user should never be suddenly be denied access to "genuine-only" features because of human, server, or network error in Redmond. Until you can verify that will be the case, there is no advantage in WGA, and Microsoft is just blatantly treading on the backs of it's legitimate customers.

  • Anonymous
    August 29, 2007
    I find it interesting that this was a check in problem.  With all the work that went in to check in's with Vista, it seems inconceivable that code can be handled apparently so haphazardly.  The effects of this are damaging and widespread.  Hopefully some heads are rolling, because it may take years to gain back what has been lost.

  • Anonymous
    August 29, 2007
    Windows Server Division WebLog : Windows Server 2008 Timing Update: http://blogs.technet.com/windowsserver

  • Anonymous
    August 29, 2007
    The users being hit by this type of failure are the users with "install updates automatically" set to on, especially since WGA preassumes any problem is illegal until proven otherwise with no grace period, and, from the user side, the danger seems to be escalating. There are very few days when most users can afford to wait for inevitable mistakes such as these to be addressed. At the very least, impliment a grace period. Beverly Howard [MS MVP-Mobile Devices]

  • Anonymous
    August 29, 2007
    In a day of ironies... Windows XP sp3 is announced.... Windows Vista SP1 and Windows XP SP3 Announcement

  • Anonymous
    August 30, 2007
    Brings back an old saying from when I first entered the IT field several years ago, "To err is human

  • Anonymous
    August 30, 2007
    Brings back an old saying from when I first entered the IT field several years ago, "To err is human

  • Anonymous
    August 30, 2007
    ...thats why Microsoft Products shouldn't be used. One big company has the power over YOUR

  • Anonymous
    August 30, 2007
    First of all, thank you for being so honest! But let me tell you, what I was experiencing last Saturday morning to noon (I'm living in Germany). I was planning to do some urgent bugfixes to one of my applications. But first, I wanted to install the Vista Explorer bugfix... When I realized, that my Vista Enterprise Edition wasn't valid anymore, I started blaming my son and the friend of my daughter, that anyone of them must have stolen my MSDN Volume License Key. I was absolutely sure, that this must have been the cause, why my legal Vista installation was no longer valid. They are both still angry with me because I made that unfair accusation to them. Then, I had two hours on the phone and in the internet, desperately trying to fix the mistake. No one of the guys, to whoom I spoke knew anything about the problem. They even questioned the product key. As a serious software consultant I cannot risk being accused of using illegal software. Also, debugging and testing on a somehow "locked" system doesn't make much sense because you can't say, if a certain bug is related to your software or to the locked down OS. As a developer, I was working with Microsoft for more than 20 years. But this incidence increases my growing doubts, whether Windows is really the platform I should focus in the future.

  • Anonymous
    August 30, 2007
    Further investigation by our WGA team has brought to light more information on the WGA validation issue

  • Anonymous
    August 30, 2007
    The comment has been removed

  • Anonymous
    August 30, 2007
    what you need to do is have it so that when it fails on 1 server to detect genuine windows have it automatically check another server only when it fails both server checks then mark it non genuine

  • Anonymous
    August 31, 2007
    Thanks Alex for explaining about the issue. I am really happy that things are under control :)

  • Anonymous
    September 05, 2007
    So if the is system is 'designed to default to genuine', then why not simulate an outage by disabling connectivity while the rollback was occurring. This would allow your customers to continue without the erroneous validations. This would have been a non-story. Seems so obvious, I'm sure it was considered (??)

  • Anonymous
    October 24, 2007
    I find this an unfortunate an ocassional MSFT ostrich approach Alex.  There are still WGA problems, and if someone is a customer of Office 2007, when they call to your Canadian contracters Convergys of Ohio to resolve one, they are told that they have support that would start if they invoked it for 90 days and then the "support" is gone.   This makes no sense whatsoever which is why I worked early on to get myself in a position where I wouldn't need the support but it is hardly fair to MSFT customers.  They shouldn't have their support clock running in Office because of WGA team failures. I also tried to post this message 12 hours ago and apparently you're censoring it out which is also unfortunate. Pretending problems aren't there with WGA is not helping MSFT Alex. Ed Bott has tracked them as well and I'd say he's also a reliable source: http://blogs.zdnet.com/Bott/?p=142

  • Anonymous
    October 24, 2007
    The comment has been removed

  • Anonymous
    December 04, 2007
    The comment has been removed

  • Anonymous
    December 04, 2007
    The comment has been removed

  • Anonymous
    December 04, 2007
    The comment has been removed

  • Anonymous
    December 05, 2007
    The comment has been removed