VM watch Collectors Suite

VM watch collectors are designed to gather VM health data on various resources like disk and network, by running health checks within the VM. This suite of collectors aid in identifying issues, monitoring performance trends, and optimizing resources to enhance the overall user experience.

This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the VM watch overview page.

Prerequisites

This article assumes that you're familiar with:

Note

Name Description
Collector Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource
Signals What is emitted to reflect the health status of VMs. The three types of signals emitted are checks, metrics, and logs
Group Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements
Tags Used to categorize and filter checks, metrics, and logs
Eligibility Determines whether a collector is eligible to be executed based on the environment attributes you specify
Default Behavior Standard setting and action that would be followed if no custom configurations are provided.
Overwritable Parameters Associated parameters that can be customized to override the default configuration

Groups, tags and corresponding checks, metrics, and event logs

Collector Name Group Tags Checks Metrics Event Logs
outbound_connectivity Core Network
  • outbound_connectivity
dns Core Network
  • dns
tcp_stats Core Network
  • SegmentsRetransmitted
  • TCPSynRetransmits (Linux only)
  • NormalizedSegmentsRetransmitted
  • ConnectionResets
  • NormalizedConnectionResets
  • FailedConnectionAttempts
  • NormalizedFailedConnectionAttempts
  • ActiveConnectionOpenings
  • PassiveConnectionOpenings
  • CurrentConnections
  • SegmentsReceived
  • SegmentsSent
clock_skew Core Clock
  • clockskew
disk_io Core Disk
  • disk_io
  • UsedSpaceInBytes
  • FreeSpaceInBytes
  • CapacityInBytes
  • UsedPercent
disk_iops Core Disk
  • WriteOps
  • ReadOps
imds Core IMDS
  • imds
process Core Process
  • process
process_memory Core Process
  • ProcessRSSPercent
  • ProcessPageFaults
  • MachineMemoryTotalInBytes
  • MachineMemoryUsedPercent
  • TotalPageFaults
process_cpu Core Process
  • ProcessCPUCoreUsage
  • ProcessCPUMachineUsage
  • MachineTotalCpuUsage
process_monitor Optional Process
  • process_monitor
  • UpTime
system_error Core OS
  • SystemErrors
az_storage_blob Optional AzBlob
  • az_storage_blob
hardware_health_monitor Optional Hardware
  • hardware_health_monitor
hardware_health_nvidia_smi Optional Hardware
  • hardware_health_nvidia_smi

Eligibility, default behavior, and overwritable parameters

Collector Name Eligibility Default Behavior Overwritable Parameters
outbound_connectivity Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" This collector is executed every 60s. In each execution, it sends an http GET request to http://www.msftconnecttest.com/connecttest.txt with a time-out of 5s. If the request fails, it retries at most two more times with and interval of 10s. The verification is marked as "Failed" if all the retries fail.
  • OUTBOUND_CONNECTIVITY_INTERVAL: the execution interval of the Collector. Default: 60s
  • OUTBOUND_CONNECTIVITY_URLS: the URLs that this Collector sends http GET requests to. URLs are provided as a string using , as separator. Default: http://www.msftconnecttest.com/connecttest.txt
  • OUTBOUND_CONNECTIVITY_TIMEOUT_IN_MILLISECONDS: the http GET request time-out in milliseconds. Default: 5000
  • OUTBOUND_CONNECTIVITY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 3
  • OUTBOUND_CONNECTIVITY_RETRY_INTERVAL_IN_SECONDS: the retry interval in seconds if the previous http request fails. Default: 10
dns Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" This Collector is executed every 180s. In each execution, it tries to resolve the DNS name www.msftconnecttest.com . The verification is marked as "Failed" if the DNS name can't be resolved.
  • DNS_INTERVAL: the execution interval of the Collector. Default: 180s
  • DNS_NAMES: the domain names to be resolved separated by ,. Default: www.msftconnecttest.com
tcp_stats Always eligible This collector is executed every 180s. In each execution, it collects the TCP statistics of the last 180s.
  • TCP_STATS_INTERVAL: the execution interval of the Collector. Default: 180s
  •   
clock_skew Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" This collector is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server time.windows.com and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)".
  • CLOCK_SKEW_INTERVAL: the execution interval of the Collector. Default: 180s
  • CLOCK_SKEW_NTP_SERVER: the remote NTP server used to calculate clock skew. Default: time.windows.com
  • CLOCK_SKEW_TIME_SKEW_THRESHOLD_IN_SECONDS: the threshold in seconds of clock offset to mark the verification as "Failed". Default: 5.0
disk_io Always eligible if mount points aren't specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM This collector is executed every 180s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info including used space, free space, total capacity and used percentage from each mount point.
  • DISK_IO_INTERVAL: the execution interval of the Collector. Default: 180s
  • DISK_IO_MOUNT_POINTS: the mount points separated by ,. No default value
  • DISK_IO_IGNORE_FS_LIST: the file system list that should be ignored separated by ,. Default: tmpfs,devtmpfs,devfs,iso9660,overlay,aufs,squashfs,autofs
  • DISK_IO_FILENAME: the name of the file used to verify the file read/write. Default: vmwatch-{timestamp}.txt
disk_iops Always eligible This collector is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device.
  • DISK_IOPS_INTERVAL: the execution interval of the Collector. Default: 180s
  • DISK_IOPS_DEVICES: the device names separated by ,. No default value
  • DISK_IOPS_IGNORE_DEVICE_REGEX: the regex of the device name that should be ignored. Default: loop
imds Always eligible This collector is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query time-out is 10s. If the query fails, it retries at most another three more times with an interval of 15s, 30s, and 45s.
  • IMDS_INTERVAL: the execution interval of the Collector. Default: 180s
  • IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:http://169.254.169.254/metadata/instance/compute
  • IMDS_TIMEOUT_IN_SECONDS: the time-out in seconds of each query. Default: 10
  • IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
  • IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
process Always eligible This collector is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The time-out of process execution is 10s.
  • PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s
  • PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
process_memory Always eligible This collector is executed every 180s. In each execution, it selects the top three processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent, and TotalPageFaults.
  • PROCESS_MEMORY_INTERVAL: the execution interval of the Collector. Default: 180s
  •   
process_cpu Always eligible This collector is executed every 180s. In each execution, it selects the top three processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage, and MachineTotalCpuUsage.
  • PROCESS_CPU_INTERVAL: the execution interval of the Collector. Default: 180s
  •   
process_monitor Always eligible Not executed. If explicitly enabled by the user, this collector verifies if the selected process is running and collect its running time in seconds.
  • PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
  • PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by ,. No default value
system_error Eligible in Windows machine The Collector is executed every three mins. In each execution, it subscribes to the "System" channel of Windows EventLog and queries events with level defined in SystemData <=2 (including LOG_ALWAYS, Critical, Error). The measurementTarget is defined as Source_EventId of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.
  • SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
az_storage_blob Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" Not executed. If explicitly enabled by the user, this collector verifies if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.
  • AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s
  • AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value
  • AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value
  • AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value
  • AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value
  • AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value
  • AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false
  • AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
hardware_health_monitor Eligible in Windows machine Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549.
  • HARDWARE_HEALTH_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
  •   
hardware_health_nvidia_smi Eligible in Linux Ubuntu machine Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549.
  • HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the execution interval of the Collector. Default: 60s
  • HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the time-out of running /usr/bin/nvidia-smi command. Default: 10s
  •  

Next steps