Windows Azure IaaS Host OS Update Demystified

Article
11/26/2013

Special thanks to Sri Harsha for reviewing this post!!

In this post we will look at:

Why Windows Azure updates the host OS?
How does the host OS update take place?
What are availability sets?
How does creating availability sets make your application highly available?
Resource to help you create and manage highly available applications in Windows Azure Virtual Machines
More Information

Why Windows Azure updates the host OS?

Windows Azure deploys updates to the host OS approximately once per month. This ensures that Windows Azure provides a reliable, efficient and secure platform for hosting your applications.

How does the host OS update take place?

The host OS update on the Windows Azure platform is different when compared to how you update your PCs or servers running on Windows OS. In case of Windows Azure a new image which contains all the latest updates and fixes is deployed to all the servers and the Fabric Controller will instruct these servers to restart and boot from the newly deployed image. So unlike a Windows update which could take considerable amount of time to complete, the Windows Azure host OS update just involves booting from a new image. Typically this Host OS update process will take 15 to 20 minutes to complete.

What are availability sets?

When you have two or more VMs that perform the same task (Example: two or more web servers) you create an availability set with those two VM. Creating this availability makes your application highly available and also makes you eligible for the 99.9% uptime SLA.

How does creating availability sets make your application highly available?

When you create an availability set you are instructing the Fabric Controller that all the VM in an availability set perform the same function and must not be taken down at the same time for scheduled maintenance.

Behind the scene what happens is, the Fabric Controller will intelligently place these VMs on different update domains (UD). These UDs are logical classification which will help the FC to ensure that all VMs in the same AS are not taken down at the same time during any scheduled maintenance. This will ensure that there are VMs which are always available to process requests.

Note:

Test/monitor to make sure a reduced VM count running the workload provides sufficient performance so your service is not negatively impacted during planned maintenance while one or more VMs are unavailable.
If you are using end point to allow incoming traffic from the outside world ensure that it is load balanced. (See “Creating Highly Available Workloads with Windows Azure” below.)

Resource to help you create and manage highly available applications in Windows Azure Virtual Machines

More information

Windows Azure Host OS Updates: Why, When and How: https://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx

Role Instance Restarts Due to OS Upgrades: https://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx

Comments

Anonymous
November 26, 2013
Very nicely put!!!!
Anonymous
November 28, 2013
Are you talking about IaaS or PaaS? The outage is on IaaS isn't it? But your description is about PaaS. You don't replace the images that IaaS Virtual Machines are built using, so I assume that isn't what the outage is about.
Anonymous
December 16, 2013
Added: How do you confirm that your VM was restarted by Windows Azure?
Anonymous
December 17, 2013
Darren, this article is about IaaS. The image replacement being referred to is for the Host OS, not the Guest OS.
Anonymous
January 13, 2014
Thanks, Kevin. I obviously wasn't paying attention :)
Anonymous
January 24, 2014
Added For More information
Anonymous
January 30, 2014
Are you aware that competitors like AWS rarely reboot the VMs ? Some of my AWS instances are up since more than 4 years as of writing ! Some popular clustered servers, like HBase or RabbitMQ, would not behave correctly when a node is restarted: it would lead to downtime of a part of the data and badly load balanced cluster (because of table migration phase)
Anonymous
February 03, 2014
This doesn't work when you have a SQL Server on an Azure VM (SQL Azure is feature limited and can't be used for many folks needs).
Anonymous
August 19, 2014
Very helpful, You realize that an availability set will double the cost? Especially with MSSQL it will make things very complex.
Anonymous
September 04, 2014
The issue we have is that even if the machine are in an availability sets it doesn't mean that Azure ensure the services and application are started before rebooting another server. If your application need 30mn to start you can have a downtime. And this is not acceptable for critical mission application like e-commerce servers. What MS should do, is deploy the current VM on other physical servers and reboot servers with no running applications.
Anonymous
February 24, 2015
Are you aware that if you have worker roles that do calculations and input results in the DB, you cannot have them both running in parallel as you will get the results inserted in the DB twice when there is no host update going on (i.e. when one of hte instances doesnt get forced to stop). What more, it doesnt really solve anything for calculations that last 2-4 hours. because they can both be interupted at different times. These updates would be acceptable if we could scedule them forhte weekend when no calculations are running, but that is not possible either. I am already evaluating moving our 40 instances over to AWS due to this reason only.

Share via