From Datacenters to Digital plants - Updated
Hi,
As you may have read in some of my previous posts, I spent the first 7 months of 2010 contributing to the SharePoint Online standard 2010 infrastructure. This infrastructure will provide SharePoint 2010 based services to tens of millions of people worldwide. In fact, it makes SharePoint fly in the Cloud, using Microsoft “boxed” / available products.
There are few keys to have in mind when thinking about this:
- SharePoint Online IS the SharePoint cloud (service)
- The underlying infrastructure of SharePoint Online needs to enable all the cloud computing attributes (more info here: https://en.wikipedia.org/wiki/Cloud_computing#Key_features)
- The only way to implement these key features is to HEAVILY automate SharePoint and its underlying technologies (Windows Server, AD DS, Hyper-V, SQL Server) for the obvious ones.
- The key enabler of all this is definitely Windows PowerShell v2 (https://technet.microsoft.com/en-us/library/ee829690.aspx), but that's just a tool.
If you follow my blog, you will know that SharePoint Automation is a whole topic in itself, and one that I am passionate about.
SharePoint & Cloud computing
I have developed the following broad thinking for automation and optimization of the services:
- A well-structured model is essential here. It should be based on the core concepts as implemented through the processes and can be used to develop the arguments for each task.
- In a cloud(y) environment, failure to get requested operations implemented is often the rule => Expect failure, and plan on this base.
- Here are few key concepts, features and characteristics of a large automated SharePoint system:
- Reliability and resiliency - the service itself must be able to withstand whatever may happen in the layers below it.
- Idempotency - the characteristics of operational tasks that can be replayed very often, without creating any discrepancy between the desired and actual states of the system.
- Scale units - plan resources with coherent set of resources, called scale units.
- Virtualization - from Hyper-V to “hypervisor on chip”; but that also could work with VMWare, Xen or VBox, if the necessary tasks are exposed with APIs.
- Utility computing - use when needed, compute, then drop or re-assign.
- Autonomic systems - that manage themselves to serve their purpose.
- Always optimize the resources, and on a continual basis
- Elasticity - ability to absorb unexpected demand fluctuations
- Agility - rapid provisioning of additional resources or services
- System metering (monitoring and telemetry) - critical to know what happens.
- Trust is necessary between components (sharing the data and security context) in distributed systems.
- Continuous deployment - cut each service into smaller parts linked by interface contracts, so that each part can be continuously improved.
I spent a lot of time studying these concepts with 2 aspects in mind:
- Theoretically : Thanks to various internal works and lectures, partners and competitors White Papers, books & keynotes
- Practically : Learning how MSFT use them in its Datacenters for its Cloud services (ranging from Live Messenger/Hotmail, Bing, Office365 and Azure platform).
It led me realizing that: Cloud computing is the "Industrial age" of Datacenters!
Industrial Engineering in Cloud computing
My academic background is in Industrial Engineering, with a (French equivalent) Master of Science diploma in Industrial Engineering and certain advanced specializations.
Cloud computing terminology is very close to that of Industrial Engineering (sometimes closer to than to Computer science). Check for yourself terms you’ll find in the references and books at the end of this post, such as:
- Resource management
- Resiliency
- Elasticity
- Common environmental requirements
- Power
- Efficiency
- Productivity
- Cost optimization
- Lifecycles
- Workloads
ALL these words are from mechanics, physics or industrial engineering that have been developed for more than a century, when “factories” were invented.
When you see a “modern – Internet scale” Datacenter … what do you see? You see a huge building, looking like a car supplier factory more than a “data processing” facility, don’t you? So I started to analyze the core similarity between the “factories” that I used to work with while in automotive industry, and the Cloud computing concepts. This is how I came to the “Digital Plant” concept.
From Datacenters to Digital Plants ...
While I studied which Industrial approaches and tools could be transferred to design Digital Plants, I found common points between Computer and Industrial science:
- Lifetime and workload management are the same issues and drivers
- Discrete manufacturing (like cars) can bring a lot on the planning and resources/assets optimization
- Process manufacturing (like glass or chemistry) can bring a lot on the "recipes" optimization to deliver a product (a service in the case of the Digital plant)
- Utilities industries - of course - (like electricity or water) can bring a lot on the way they plan and optimize their resources generation, distribution and final consumption, as almost no stock is possible. And you cannot "store" either an online service.
Based on these roots, I searched for technics and mathematical tooling used in Industry that could be applied to Cloud computing. I found some of them very useful and interesting when transferred into a Digital plant. They include:
- Multiple Products (and Services) Lifecycle Management in parallel:
- Digital plant building and foundation has a "long" lifetime (around 10 years)
- Digital plant productive components (such as servers, network hardware, etc.) has a "middle" lifetime (around 3 years)
- Digital plant sold service (the Online service which is sold) has a "short" lifetime, as its software based, and evolving in a very competitive environment (can be few weeks)
- More info here: https://en.wikipedia.org/wiki/Product_lifecycle_management
- The Pareto distribution applies:
- This rule commonly named '80-20' is key for planning and optimization
- More info here: https://en.wikipedia.org/wiki/Pareto_distribution
- Resources optimization:
- Workload demand fluctuates constantly
- So the resources are always a little bit either too much or not enough available
- It leads to take an analytical approach, in fact, a holistic approach for planning and optimization:
- it's impossible to calculate according all the inputs to obtain the expected outputs with the assets used.
- It means constant tradeoff in design and operations of the resources.
=> Resource optimization is the essence of Industrial Engineering.
The only way to get some manageable system is to switch to statistical and probabilistic approaches, driven by mathematical models
It means Digital plants mix and optimize a lot of different distributions, like:
- The normal (Gaussian) one
- Log / exponential ones
- Poisson rule and its derivated
- Plus there associated transfer functions to extract from data/measures the trends to prepare decisions
- More info here: https://en.wikipedia.org/wiki/Distribution_(mathematics)
Here are few examples of what should be considered relevant models:
- Look at the CPU usage for a given period of time of a virtual machine:
If you transform this measure into time spent in "ranges" (like 0 to 10%, 10 to 19%, etc.), you'll obtain a CPU resource distribution close to this model
- Another pattern you may often find is the "parts distribution" like these 2 ones, which are extracted from storage data analysis on a disk (or LUN): it represents the total size stored on this LUN per file extension:
Now let's zoom on the first 10%, and the trend is clearer: This is an exponential distribution
And the funny part. This same LUN, now seen through the items counts per file extension: same pattern!
- For the demand fluctuations/constant changes, the stock analysis models may be very useful. One to consider would be the Elliott wave:
- These models are the key for the resource optimizer to implement and continuously tune in a Digital plant. It adds the idea that the past helps to prepare the future.
- An important key is to find the correlations between the relevant inputs and the efficient result/product/service to provide. Here again, Industrial engineering methodology, especially the Taguchi related ones should be of great help.
- My last point here would be about the sampling rates. All this is based on large data collection. But this data collection, should not impact the service. It has to be tuned to get accurate correlations but without adding disturbance to the system itself. Another tradeoff endless scenario experienced in Industrial Engineering.
As I moved on to sort out all these ideas, I realized various things:
- There's not that much research in this area - current approaches are still analytical and driven to achieve an exact situation (which is not particularly useful)
- Many ideas are emerging from this. There's probably space to study, and write a book on them
- All this is possible thanks to the huge improvements in network bandwidth and availability over the years. When you think back to BNC Ethernet networks 20 years ago, they were so much slower than my current personal ADSL connection to the Internet ....
To end this call for a story to write, I'd like to emphasis 2 things:
- Have a look at what an Industrial Plant is (https://en.wikipedia.org/wiki/Industrial_plant ) then watch and browse the references below on current Datacenters. You'll be hit by the similarities.
- I want to contribute to this cross discipline effort, where the best of 2 worlds (Computer & Industrial sciences) come together to create a new environment for innovation.
Contact me if you're interested too :-)
Thanks reading this long post, which is broader than just SharePoint automation.
< Emmanuel />
References:
The evolution of Cloud Computing and the changing anatomy of a Microsoft data center: https://sharepointineducation.com/the-evolution-of-cloud-computing-and-the-changing-anatomy-of-a-microsoft-data-centre
Data Center knowledge: https://www.datacenterknowledge.com/
Microsoft Datacenters Blog: https://blogs.technet.com/b/msdatacenters/
Automation definition: https://en.wikipedia.org/wiki/Automation
Automation Outline: https://en.wikipedia.org/wiki/Outline_of_automation
Principles & Theory of Automation: https://www.britannica.com/EBchecked/topic/44912/automation/24841/Principles-and-theory-of-automation
The Cloud: Battle of the Tech Titans: https://www.businessweek.com/magazine/content/11_11/b4219052599182.htm
Books:
The Datacenter as a Computer, Luiz Barroso, Urs Hölzle (Google Inc.), Morgan & Claypool, https://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006
Monitoring the Data Center, Virtual Environments, and the Cloud, Don Jones, https://nexus.realtimepublishers.com/dgmdv.php
The Big Switch, Nicholas Carr, W. W. Norton & Company, https://www.nicholasgcarr.com/bigswitch/