VM watch Collectors Suite
VM watch collectors are designed to gather VM health data on various resources like disk and network, by running health checks within the VM. This suite of collectors aid in identifying issues, monitoring performance trends, and optimizing resources to enhance the overall user experience.
This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the VM watch overview page.
Prerequisites
This article assumes that you're familiar with:
Note
Name | Description |
---|---|
Collector | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource |
Signals | What is emitted to reflect the health status of VMs. The three types of signals emitted are checks, metrics, and logs |
Group | Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements |
Tags | Used to categorize and filter checks, metrics, and logs |
Eligibility | Determines whether a collector is eligible to be executed based on the environment attributes you specify |
Default Behavior | Standard setting and action that would be followed if no custom configurations are provided. |
Overwritable Parameters | Associated parameters that can be customized to override the default configuration |
Groups, tags and corresponding checks, metrics, and event logs
Collector Name | Group | Tags | Checks | Metrics | Event Logs |
---|---|---|---|---|---|
outbound_connectivity | Core | Network |
|
||
dns | Core | Network |
|
||
tcp_stats | Core | Network |
|
||
clock_skew | Core | Clock |
|
||
disk_io | Core | Disk |
|
|
|
disk_iops | Core | Disk |
|
||
imds | Core | IMDS |
|
||
process | Core | Process |
|
||
process_memory | Core | Process |
|
||
process_cpu | Core | Process |
|
||
process_monitor | Optional | Process |
|
|
|
system_error | Core | OS |
|
||
az_storage_blob | Optional | AzBlob |
|
||
hardware_health_monitor | Optional | Hardware |
|
||
hardware_health_nvidia_smi | Optional | Hardware |
|
Eligibility, default behavior, and overwritable parameters
Collector Name | Eligibility | Default Behavior | Overwritable Parameters |
---|---|---|---|
outbound_connectivity | Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" | This collector is executed every 60s. In each execution, it sends an http GET request to http://www.msftconnecttest.com/connecttest.txt with a time-out of 5s. If the request fails, it retries at most two more times with and interval of 10s. The verification is marked as "Failed" if all the retries fail. |
|
dns | Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" | This Collector is executed every 180s. In each execution, it tries to resolve the DNS name www.msftconnecttest.com . The verification is marked as "Failed" if the DNS name can't be resolved. |
|
tcp_stats | Always eligible | This collector is executed every 180s. In each execution, it collects the TCP statistics of the last 180s. |
|
clock_skew | Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" | This collector is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server time.windows.com and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)". |
|
disk_io | Always eligible if mount points aren't specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM | This collector is executed every 180s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info including used space, free space, total capacity and used percentage from each mount point. |
|
disk_iops | Always eligible | This collector is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device. |
|
imds | Always eligible | This collector is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query time-out is 10s. If the query fails, it retries at most another three more times with an interval of 15s, 30s, and 45s. |
|
process | Always eligible | This collector is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The time-out of process execution is 10s. |
|
process_memory | Always eligible | This collector is executed every 180s. In each execution, it selects the top three processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent, and TotalPageFaults. |
|
process_cpu | Always eligible | This collector is executed every 180s. In each execution, it selects the top three processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage, and MachineTotalCpuUsage. |
|
process_monitor | Always eligible | Not executed. If explicitly enabled by the user, this collector verifies if the selected process is running and collect its running time in seconds. |
|
system_error | Eligible in Windows machine | The Collector is executed every three mins. In each execution, it subscribes to the "System" channel of Windows EventLog and queries events with level defined in SystemData <=2 (including LOG_ALWAYS, Critical, Error). The measurementTarget is defined as Source_EventId of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection. |
|
az_storage_blob | Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" | Not executed. If explicitly enabled by the user, this collector verifies if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token. |
|
hardware_health_monitor | Eligible in Windows machine | Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. |
|
hardware_health_nvidia_smi | Eligible in Linux Ubuntu machine | Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. |
|