VM watch Collectors Suite

Article
03/03/2025

VM watch collectors are designed to gather VM health data on various resources like disk and network, by running health checks within the VM. This suite of collectors aid in identifying issues, monitoring performance trends, and optimizing resources to enhance the overall user experience.

This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the VM watch overview page.

Prerequisites

This article assumes that you're familiar with:

Note

Name	Description
Collector	Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource
Signals	What is emitted to reflect the health status of VMs. The three types of signals emitted are checks, metrics, and logs
Group	Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements
Tags	Used to categorize and filter checks, metrics, and logs
Eligibility	Determines whether a collector is eligible to be executed based on the environment attributes you specify
Default Behavior	Standard setting and action that would be followed if no custom configurations are provided.
Overwritable Parameters	Associated parameters that can be customized to override the default configuration

Groups, tags and corresponding checks, metrics, and event logs

Collector Name	Group	Tags	Checks	Metrics	Event Logs
outbound_connectivity	Core	Network	outbound_connectivity
dns	Core	Network	dns
tcp_stats	Core	Network		SegmentsRetransmitted TCPSynRetransmits (Linux only) NormalizedSegmentsRetransmitted ConnectionResets NormalizedConnectionResets FailedConnectionAttempts NormalizedFailedConnectionAttempts ActiveConnectionOpenings PassiveConnectionOpenings CurrentConnections SegmentsReceived SegmentsSent
clock_skew	Core	Clock	clockskew
disk_io	Core	Disk	disk_io	UsedSpaceInBytes FreeSpaceInBytes CapacityInBytes UsedPercent
disk_iops	Core	Disk		WriteOps ReadOps
imds	Core	IMDS	imds
process	Core	Process	process
process_memory	Core	Process		ProcessRSSPercent ProcessPageFaults MachineMemoryTotalInBytes MachineMemoryUsedPercent TotalPageFaults
process_cpu	Core	Process		ProcessCPUCoreUsage ProcessCPUMachineUsage MachineTotalCpuUsage
process_monitor	Optional	Process	process_monitor	UpTime
system_error	Core	OS		SystemErrors
az_storage_blob	Optional	AzBlob	az_storage_blob
hardware_health_monitor	Optional	Hardware			hardware_health_monitor
hardware_health_nvidia_smi	Optional	Hardware			hardware_health_nvidia_smi

Eligibility, default behavior, and overwritable parameters

Collector Name	Eligibility	Default Behavior	Overwritable Parameters
outbound_connectivity	Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"	This collector is executed every 60s. In each execution, it sends an http GET request to `http://www.msftconnecttest.com/connecttest.txt` with a time-out of 5s. If the request fails, it retries at most two more times with and interval of 10s. The verification is marked as "Failed" if all the retries fail.	OUTBOUND_CONNECTIVITY_INTERVAL: the execution interval of the Collector. Default: 60s OUTBOUND_CONNECTIVITY_URLS: the URLs that this Collector sends http GET requests to. URLs are provided as a string using `,` as separator. Default: `http://www.msftconnecttest.com/connecttest.txt` OUTBOUND_CONNECTIVITY_TIMEOUT_IN_MILLISECONDS: the http GET request time-out in milliseconds. Default: 5000 OUTBOUND_CONNECTIVITY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 3 OUTBOUND_CONNECTIVITY_RETRY_INTERVAL_IN_SECONDS: the retry interval in seconds if the previous http request fails. Default: 10
dns	Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"	This Collector is executed every 180s. In each execution, it tries to resolve the DNS name `www.msftconnecttest.com` . The verification is marked as "Failed" if the DNS name can't be resolved.	DNS_INTERVAL: the execution interval of the Collector. Default: 180s DNS_NAMES: the domain names to be resolved separated by `,`. Default: `www.msftconnecttest.com`
tcp_stats	Always eligible	This collector is executed every 180s. In each execution, it collects the TCP statistics of the last 180s.	TCP_STATS_INTERVAL: the execution interval of the Collector. Default: 180s
clock_skew	Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"	This collector is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server `time.windows.com` and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)".	CLOCK_SKEW_INTERVAL: the execution interval of the Collector. Default: 180s CLOCK_SKEW_NTP_SERVER: the remote NTP server used to calculate clock skew. Default: time.windows.com CLOCK_SKEW_TIME_SKEW_THRESHOLD_IN_SECONDS: the threshold in seconds of clock offset to mark the verification as "Failed". Default: 5.0
disk_io	Always eligible if mount points aren't specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM	This collector is executed every 180s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info including used space, free space, total capacity and used percentage from each mount point.	DISK_IO_INTERVAL: the execution interval of the Collector. Default: 180s DISK_IO_MOUNT_POINTS: the mount points separated by `,`. No default value DISK_IO_IGNORE_FS_LIST: the file system list that should be ignored separated by `,`. Default: tmpfs,devtmpfs,devfs,iso9660,overlay,aufs,squashfs,autofs DISK_IO_FILENAME: the name of the file used to verify the file read/write. Default: vmwatch-{timestamp}.txt
disk_iops	Always eligible	This collector is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device.	DISK_IOPS_INTERVAL: the execution interval of the Collector. Default: 180s DISK_IOPS_DEVICES: the device names separated by `,`. No default value DISK_IOPS_IGNORE_DEVICE_REGEX: the regex of the device name that should be ignored. Default: loop
imds	Always eligible	This collector is executed every 180s. In each execution, it queries the IMDS endpoint `http://169.254.169.254/metadata/instance/compute` and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query time-out is 10s. If the query fails, it retries at most another three more times with an interval of 15s, 30s, and 45s.	IMDS_INTERVAL: the execution interval of the Collector. Default: 180s IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:`http://169.254.169.254/metadata/instance/compute` IMDS_TIMEOUT_IN_SECONDS: the time-out in seconds of each query. Default: 10 IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4 IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
process	Always eligible	This collector is executed every 180s. In each execution, it creates and executes command `${SYTEM_DIR}\system32\cmd.exe /c echo hello` in Windows machine and `/bin/sh -c echo hello` in Linux machine. The time-out of process execution is 10s.	PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
process_memory	Always eligible	This collector is executed every 180s. In each execution, it selects the top three processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent, and TotalPageFaults.	PROCESS_MEMORY_INTERVAL: the execution interval of the Collector. Default: 180s
process_cpu	Always eligible	This collector is executed every 180s. In each execution, it selects the top three processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage, and MachineTotalCpuUsage.	PROCESS_CPU_INTERVAL: the execution interval of the Collector. Default: 180s
process_monitor	Always eligible	Not executed. If explicitly enabled by the user, this collector verifies if the selected process is running and collect its running time in seconds.	PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by `,`. No default value
system_error	Eligible in Windows machine	The Collector is executed every three mins. In each execution, it subscribes to the "System" channel of Windows EventLog and queries events with level defined in SystemData <=2 (including LOG_ALWAYS, Critical, Error). The measurementTarget is defined as Source_EventId of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.	SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
az_storage_blob	Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"	Not executed. If explicitly enabled by the user, this collector verifies if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.	AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
hardware_health_monitor	Eligible in Windows machine	Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549.	HARDWARE_HEALTH_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
hardware_health_nvidia_smi	Eligible in Linux Ubuntu machine	Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549.	HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the execution interval of the Collector. Default: 60s HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the time-out of running /usr/bin/nvidia-smi command. Default: 10s

Share via

VM watch Collectors Suite

Prerequisites

Groups, tags and corresponding checks, metrics, and event logs

Eligibility, default behavior, and overwritable parameters

Next steps

Feedback

Additional resources