Operations Manager Linux/UNIX agent failover and Resource Pools
Introduction
While answering questions on the Microsoft Technet Forums, I noticed that are lots of questions on topics, which are repeating, or which often are not that well understood. So I had a discussion with Sam (Sameer Mhaisekar) and we decided that it would be very beneficial if we write short blog posts on those topics.
One such topic is agent failover in SCOM and the difference between a Windows and Linux/UNIX agent. In order to understand how the Linux/UNIX agent failover works, one need to first become acquainted with the concept of Resource Pools in Operations Manager.
Sam already wrote the first part on this subject, where he explains the basic of Resource Pools and gives also example of how the failover of the Windows agent work in SCOM. He also referenced 3 very important articles about Resource Pools in SCOM, which are you need to read before getting to Linux/UNIX agent failover.
Cross platform agent architecture
Before jumping to failover, it is important to mention some important facts about the architecture of Linux/UNIX agent.
The Linux/UNIX agent is very different from the Windows one and hasn’t been changed since Operations Manager 2012. One of the most important functional difference, compared to a Windows agent is the absence of a Health Service implementation. So, all the monitored data is passed to the Health Service on a management server, where all the management pack workflows are being run. This makes it a passive agent, which is being queried (using the WSMan protocol, Port 1270) for availability and performance data by the management servers in the Resource Pool.
Some important considerations
The first important thing that needs mentioning is that you cannot discover and monitor UNIX/Linux systems without configuring a Resource Pool first. When you start the SCOM discovery wizard you will notice that you cannot continue unless you have selected a Resource Pool from the drop-down menu.
Here a few important considerations regarding the Resource Pool that will manage the UNIX/Linux agents:
- It is recommended and, in my opinion, very important to dedicate the management servers in the cross platform Resource Pool only to UNIX/Linux monitoring. The reason for this are the capacity limits, which Operations Manager has when it comes to monitoring UNIX\Linux and Windows and which needs to be calculated very accurately. I will try to explain this in detail. A dedicated management server can handle up to 3000 Windows agents, but only 1000 Linux or UNIX computers. We already revealed the reason for that – cross platform workflows are being run on the management server and this costs performance.
So, if you have also Windows agents, reporting to the dedicated management server, capacity and scalability calculations cannot be made precise and the performance of the management server can be jeopardized.
This fully applies and is a must for larger organizations where there are many Linux or UNIX computers (hundreds of systems) and their number grows. In smaller monitored environments, where you have a small (tens of systems), (almost) static number of cross platform agents, very often, dedicating management servers only for those systems can be an overkill. So, in such cases, I often use management servers, which are not dedicated and are members of the Default Resource Pools or are managing Windows agents. This of course, is only possible if the number of Windows agents is way below the capacity limits of the management group, which would leave enough system Resources on the management server for running the Linux/UNIX workflows.
- It is very important to dedicate the management server in the cross platform Resource Pool only to UNIX/Linux monitoring. This means not only that you should not assign Windows agents to report to it, but also the server should be excluded also from the other Resource Pools in the management group (SCOM Default Resource Pools, network monitoring Resource Pools, etc.). The reason is the same – performance. If the management server participates in other Resource Pools, it will execute also other types of workflows.
To exclude the management server from the Default Resource Pools, you will need to modify their membership from automatic to manual. By default, each management server, added to the management group is automatically added to the Resource Pools that have an automatic membership type. For some of the Resource Pools this can be accomplished over the console, for others like the “All Management Servers Resource Pool” this can be done only with PowerShell:
Get-SCOMResourcePool -DisplayName ``"All Management Servers Resource Pool"
| Set-SCOMResourcePool -EnableAutomaticMembership 0
- When you do the capacity planning for your management group, make sure you don’t forget to calculate the number of UNIX or Linux computers a management server can handle in the case another member of the Resource Pool fails. Let me explain this with an example:
According to the official documentation (see the link above), a dedicated management server can handle up to 1000 Linux or UNIX computers. But, if you have two dedicated management servers in your cross platform Resource Pool and you aim for high availability, you cannot assign 2000 (2x 1000) agents to the Pool. Why? Just imagine what will happen with a management server if its "buddy" from the same Resource Pool fails and all its agents get reassigned to the one, which is still operational. You guessed right – it will be quickly overwhelmed by all the agents and become non-operational. So, the right thing to do if you would like your cross platform monitoring to be highly available, is to have 2 management for not more than 1000 agents, so that in case of failure the remaining server can still handle the performance load.
- Last, but not least, make sure the management servers, which will be used for Linux/UNIX monitoring are sized (RAM, CPUs, Disk space) according to the Microsoft recommendations.
Agent Failover
Now back to the agent failover topic…I think it got already pretty clear how the Linux/UNIX agent failover happens behind the scenes, but short summary won’t do any harm:
- After the discovery wizard is started, a Resource Pool must be selected for managing the systems.
- When the Resource Pool is selected, it assigns one of the participating management server to complete the actual discovery of the systems and take over the monitoring.
- When the management server fails, the Resource Pool selects one of its other members to take over the monitoring.
Here a reference to what we said in the beginning that the UNIX/Linux agent is passive and is being queried by the management server. Because of this, it is not actually aware of what happens in the background and continues to communicate with the server, which has been now assigned to it.
Important notes
Now is also the right time to make a couple of very import notes:
- XPlat (cross platform) certificates
Part of the preparation of the environment for the monitoring of cross platform systems is the creation of self-signed certificate and its deployment to every management server, member of the Resource Pool. This will ensure that in case of failover each management server will be able to communicate with agent, using the same certificate.
- High availability with Operations Manager Gateways as members of the Resource Pool (thanks to Graham Davies for the reminder)
What I forgot to mention in the first version of this post, but is of high importance for maintaining high availability of your Gateway Resource Pools (Resource Pool, consisting of Operations Manager Gateway servers) is the fact that two Gateways are not sufficient for achieving high availability. Why? You will find the answer in the article Kevin Holman wrote about Resource Pools in SCOM and how exactly they provide high availability This is also the same article Sam posted in the first part and it is must read if you have to plan for and manage Resource Pools and cross platform agent failover in Operations Manager:
Conclusion
Understanding UNIX/Linux agent high availability and failover is not a hard thing to do. Still, in order to properly plan for Operations Manager cross platform monitoring, there are some additional things like sizing and scalability that need to be considered.