System Center Operations Manager Incident Remediation with System Center Orchestrator
System Center Operations Manager is a monitoring tool that evaluates and reports on the health of your IT or business services in the enterprise, whether physical or virtual, distributed or centralized, private cloud or traditional infrastructure solution. Operations Manager directly supports Event Management in ITIL (Information Technology Information Library) and indirectly supports other processes or functions such as Incident Management, Service Level Management, and the Service Desk (as an example).
Commonly when an event is detected in Operations Manager, such as a spike in processor utilization or an IIS application pool keeps stopping, an alert would be generated in Operations Manager (in ITIL speak this is an event). In most circumstances, Tier-1 support such as a NOC (Network Operations Center) or the Service Desk would receive a notification (directly from Operations Manager or from their Incident Management solution such as Remedy) and then proceed to work the issue (depending on the severity or impact to the service, their skillset, and most importantly the defined functional escalation) or escalate accordingly to Tier-2.
In some circumstances, some of these scenarios may already be automated based on previous experience or an expectation that this symptom commonly occurs and therefore an automated response is implemented via script or other automation solution. However, most of the time human intervention is the first response to an alert from a monitoring tool.
This Wiki article will focus on providing guidance and recommendations to help understand the different scenarios or techniques where the integration between Operations Manager and System Center Orchestrator is relevant to supporting your ITSM processes, as well as how to implement automation solutions.
System Center Operations Manager 2007 R2 and 2012 can integrate with System Center Orchestrator through two known interfaces –
- Using the Operations Manager (2007 R2 or 2012) Integration Pack and communicating in to Orchestrator
- Orchestrator Web Service communicating in to Orchestrator
The most common scenario you see people blog about using Orchestrator is option ‘a’ above, where someone configures a runbook utilizing the Get Alert, Monitor Alert, or Monitor State activity and based on the configuration, trigger additional logic defined to remediate the IT or Business service impacted. Instead of configuring runbooks to constantly evaluate the alerts generated in Operations Manager, this article focuses on using the recovery task feature of a unit monitor, or creating a custom alert rule referencing the System.CommandExecutor write action module as a response (instead of simply alerting), and the SCOJobRunner utility written by Robert Hearns, which you can learn more about here - http://blogs.technet.com/b/orchestrator/archive/2012/05/15/cool-tool-new-command-line-utility-to-start-a-runbook.aspx. When dealing with alert rules, it will require disabling the rule defined in the sealed MP and duplicating its data source configuration using the Authoring Console or Visual Studio, because you cannot author this scenario in the Operations console.
I like the SCOJobRunner utility, as it is very simple compared to using the PowerShell approach to communicate with the Orchestrator Web Service. The pre-requisites of using this tool are the following:
- .Net Framework 4.0 is installed on the agent-managed server where the recovery task will execute on.
- An account that has rights to invoke the runbook in Orchestrator. This account will be one of the arguments passed to the SCOJobRunner utility in order to authenticate with Orchestrator and execute the specified runbook.
- The Orchestrator Web Service port is not blocked by any firewalls (the default port number is 81).
Before we proceed, copy the SCOJobRunner executable to a folder on the agent-managed system(s). Example: C:\Support\Tools.
In the following example, we are going to create a basic runbook that will be initiated by an Operations Manager unit monitor recovery task. This runbook will accept two arguments: verify the service is not running and attempt to restart the service (this can be expanded on further to generate an Operations Manager alert if the service restart attempt fails or additional logic as you see fit in your environment).
- On the computer where the Runbook Designer is installed, click Start, point to All Programs, click System Center 2012 - Orchestrator, and then click Runbook Designer.
- In Runbook Designer, in the Connections pane, click the Runbooks folder.
- In the Connections pane, click the Create a new runbook icon
- In the Runbook Designer Design workspace, right-click the Runbook tab, and then select Rename.
- Enter a name for the runbook, such as “RestartWindowsService” and press Enter.
- In the Activities pane, drag the Initialize Data activity from the Runbook Control category to the Design workspace of your runbook.
- In the Activities pane, drag the Get Service Status activity from the Monitoring category to the Design workspace and place it to the right of the Initialize Data activity.
- In the Activities pane, drag the Start/Stop Service activity from the System category to the Design workspace and place it to the right of the Get Service Status activity.
- Double-click the Initialize Data activity and on the Details tab, click Add twice.
- Click on the link for Parameter 1 and in the Data dialog box, type ComputerName and click OK.
- Click on the link for Parameter 2, and in the Data dialog box, type ServiceName and click OK.
- Create smart links between the Initialize Data, Get Service Status, and the Start/Stop Service activities.
- Double-click the Get Service Status activity and on the Details tab, perform the following:
- In the Computer field, right-click and select Subscribe\Published Data.
- In the Published Data dialog box, verify in the Activity drop-down list the Initialize Data activity is selected and in the middle pane select the published data – ComputerName and click OK.
- In the Service field, right-click and select Subscribe\Published Data.
- In the Published Data dialog box, verify in the Activity drop-down list the Initialize Data activity is selected and in the middle pane select the published data – ServiceName and click OK.
- On the Security tab, please note that by default all runbooks and activities are configured to run under the Runbook Server service account. Typically this account would not have the privileges required on any server in the data center. Therefore, configure this activity to reference an account which does have elevated rights on all servers in your data center.
- Click Finish.
- Double-click the smart link between Get Service Status and Start/Stop Service.
- On the General tab, in the Name field type Attempt Service Restart.
- On the Include tab, click on the Get Service Status link and in the Published Data dialog box, select Service Status from the published data list in the middle pane. Click OK.
- Click on the value link and in the Data dialog box, type Service stopped. Click OK.
- Click Finish to close the link properties dialog box.
- Double-click the Start/Stop Service activity, and on the Details tab, perform the following:
- In the Execution section, in the Computer field, right-click and select Subscribe\Published Data.
- In the Published Data dialog box, verify in the Activity drop-down list the Get Service Status activity is selected and in the middle pane select the published data – ComputerName and click OK.
- In the Service field, right-click and select Subscribe\Published Data.
- In the Published Data dialog box, verify in the Activity drop-down list the Initialize Data activity is selected and in the middle pane select the published data – ServiceName and click OK.
- On the Security tab, please note that by default all runbooks and activities are configured to run under the Runbook Server service account. Typically this account would not have the privileges required on any server in the data center. Therefore, configure this activity to reference an account which does have elevated rights on all servers in your data center.
- Click Finish.
- Right-click the Start/Stop Service activity and select Looping.
- On the General tab, click the checkbox for the Enable option, and in the Delay between attempts field, type 5.
- On the Exit tab, click the Add button.
- Click on the Start/Stop Service link and in the Published Data dialog box, select Service Status from the published data list in the middle pane. Click OK.
- Click on the value link and in the Data dialog box, type Service running. Click OK.
- Click Finish.
Test this runbook using Runbook Tester and verify it is working successfully. Stop a service on a particular server (in your lab, not production) and provide those parameters to the runbook accordingly.
In the following example, we are going to create a custom unit monitor that will evaluate the state of a Windows service and if it detects the service is not started, the recovery task will invoke the Orchestrator runbook and restart the service. Note that you do not have to create a custom unit monitor to run a recovery task as a response to change in health state. If an existing unit monitor defined in a sealed management pack supports your monitoring scenario, you can create a recovery task and save it to an unsealed MP.
- Create a Basic Service Monitor that targets the Windows Server 2008 Operating System class. In this example, evaluate the Windows Time service. I won’t provide a complete list of steps to create the monitor as I am assuming you are familiar with this already. If not, please review the steps outlined here - http://technet.microsoft.com/en-us/library/bb381240.aspx.
- Configure the recovery task in the following manner:
- Type of recovery task: Run Command
- Destination Management Pack: a writeable MP to save this in. (Perhaps the same MP your store your Windows Server 2008 OS overrides in.)
- Recovery name: Restart Windows Service (Orchestrator Runbook)
- Health state selected when the recovery will run: the default of Critical is fine as long as it matches the unhealthy health state of your custom Basic Service monitor.
- Recovery target: Matches the class the Basic Service monitor targets, which in this example is the Windows Server 2008 Operating System class.
- The checkboxes for Run recovery automatically and Recalculate monitor state after recovery finishes are selected.
- Full path to file: Here you are going to provide the path and filename to where you copied the SCOJobRunner executable to. For example, C:\Support\Tools\SCOJobRunner.exe.
- Parameters: Here you are going to be passing the arguments required for the utility. Robert has done a great job breaking this down on his blog (and the help that the utility returns if you run it without parameters), so I won’t repeat here. The one piece of guidance I will provide is how to obtain the runbook ID GUID. Use the following SQL Query to obtain that GUID:
**SELECT UniqueID, Name FROM dbo.Policies WHERE Name=’<Name of runbook>’ **
So if your runbook is called “RestartWindowsService” the query will look like this:
SELECT UniqueID, Name FROM dbo.Policies WHERE Name=’RestartWindowsService’
The complete set of parameters that I have defined in my example recover action is:
ID:”GUID” –Webserver:”<Name of Orchestrator server hosting the web service>” –User:<account with rights to read/execute runbook> -Domain:<domain account is a member of> -Password:<account password> -Parameters:”ComputerName=$Target/Host/Property[Type=”Windows2|Microsoft.Windows.Computer”]/NetbiosComputerName$;ServiceName=Windows Time”
When you have finished entering the parameters, click OK to save the recovery action and OK again to save the configuration of the monitor. One note to make is that in this example, on Windows Server 2008 R2, the service startup type for the Windows Time service is manual and you will need to override your custom monitor and change the parameter Alert only if service startup type is automatic from true to false.
After you have completed creating the monitor, you can wait several minutes for this configuration to be downloaded by the agents in your management group. Before testing this from Operations Manager, test it from the Command Prompt on a test server to verify that the credentials you referenced in the parameters above for SCOJobRunner have the appropriate rights and it can talk to the Orchestrator Web Service. If all stars align, you should have a runbook that successfully restarts the particular Windows service.
Now you can stop the Windows Time service and monitor the Log History for the runbook in Runbook Designer. It should show an entry indicating the recovery task from Operations Manager executed and attempted to run the runbook and its activities. If it does, the entry should show a Status of success. If it doesn’t, verify Operations Manager identified the change in state of the service and created an alert. Open Health Explorer for the Windows computer and review the State Change Event details to verify it executed the recovery action and what the result code is.
While this scenario and example is very simple in nature, my hope is the concept demonstrated helps you understand how to implement and leverage the rich integration between Operations Manager and Orchestrator using this technique. I will be looking to update this topic with other scenarios and examples, so stay tuned.