Failover Detection Utility - Availability Group Failover Analysis Made Easy

To analyze the root cause for the failover of an Availability group, users are required to perform coordinated analysis of various logs including the Cluster Logs, SQL Error Logs, and the Availability groups extended events logs. This coordinated analysis of the logs can be difficult and requires extensive knowledge of the internals and the various logs associated with Availability Groups. The failover detection utility is meant to make this analysis easier and provide a quick root cause analysis for the unexpected failovers and/or failed failovers. The failover detection utility currently only supports analysis of Availability groups on Windows.

Failover Detection Utility

The failover detection utility can be downloaded here. The download consists of the following files.

  1. FailoverDetector.exe – The Failover analysis utility.
  2. Configuration.json – The configuration file for the failover analysis utility. The file consists of the following fields.
    {
    "Data Source Path": "F:\\DataLocation\\Data",
    "Health Level": 3,
    "Instances":
    [
    "Replica1",
    "Replica2",
    "Replica3"
    ]
    }
    Data Source Path – Location for the Logs from the different replicas in the Availability Group.
    Health Level – The failover policy level configured for the availability group. More information on the different failover conditions can be found here.
    Instance – The list of replicas in the availability groups.
  3. Supporting DLL’s – Supporting DLLs for the Failover analysis utility.
  4. Data and Reports Folders – The Data and Analysis reports folders.

Preparing for Failover Analysis with the Failover Detection Utility

The first step in using the utility is to configure the configuration.json file to include the location of the data files and the details of the availability group for which analysis is being done. For a correct analysis, all replicas in the availability groups needs to be listed in the configuration file.

The next step is to capture the various logs from each of the replicas and add those under the data folder. The following files are required for the analysis.

  • SQL error logs
  • Always On Availability Groups Extended Event Logs
  • System Health Extended Event Logs
  • System log
  • Windows cluster log

The first three categories of logs can be found in the LOG folder of the SQL instance installation path. For example, if you have a SQL Server 2017 instance named "SQL17RTM01" and If the installation path is on C drive, you will find the above three categories of logs under "C:\Program Files\Microsoft SQL Server\MSSQL14.SQL17RTM01\MSSQL\Log", and copy all AlwaysOn XEvent, System Health XEvent and SQL ERRORLOG files.

The last two categories of logs can be found. You can use commands mentioned below to collect these logs.

Get-ClusterLog -Node $server.ComputerNamePhysicalNetBIOS -Destination $server.ErrorLogPath

Get-Eventlog -ComputerName $server.ComputerNamePhysicalNetBIOS -LogName System | Export-CSV -Path $SystemLogExportPath

Organizing the files

The log files collected in the previous step should be placed in a folder with correct subfolder structure. This folder should be the folder path specified in the "Data Source Path" section of the configuration file "Configuration.json". The following is an example of the subfolder structure, assuming there are two replicas (instances) in the AG. One is "Replica1" and another one is "Replica2".

¦ +---Data

¦ ¦

¦ ¦

¦ +---Replica1

¦ ¦ AlwaysOn_health_0_131532583880140000.xel

¦ ¦ AlwaysOn_health_0_131532634780070000.xel

¦ ¦ AlwaysOn_health_0_131532680183890000.xel

¦ ¦ AlwaysOn_health_0_131532701164380000.xel

¦ ¦ ERRORLOG

¦ ¦ ERRORLOG.1

¦ ¦ ERRORLOG.2

¦ ¦ ERRORLOG.3

¦ ¦ SQLDUMPER_ERRORLOG.log

¦ ¦ system_health_0_131532583879830000.xel

¦ ¦ system_health_0_131532634779760000.xel

¦ ¦ system_health_0_131532680183430000.xel

¦ ¦ system_health_0_131532701164070000.xel

¦ ¦ Replica1_system.log

¦ ¦ Replica1_cluster.log

¦ ¦

¦ +---Replica2

¦ AlwaysOn_health_0_131532571347210000.xel

¦ AlwaysOn_health_0_131532578226200000.xel

¦ AlwaysOn_health_0_131532586348180000.xel

¦ AlwaysOn_health_0_131532725682240000.xel

¦ ERRORLOG

¦ ERRORLOG.1

¦ ERRORLOG.2

¦ ERRORLOG.3

¦ ERRORLOG.4

¦ ERRORLOG.5

¦ ERRORLOG.6

¦ system_health_0_131532571346430000.xel

¦ system_health_0_131532578225950000.xel

¦ system_health_0_131532586347860000.xel

¦ system_health_0_131532725681930000.xel

¦ Replica2_system.csv

¦ Replica2_cluster.log

Executing the Utility

The FailoverDetection.exe can be executed with the following flags (or default mode without any flags)

Default Mode (No execution flags defined) - In the default mode, where no parameters are passed to the utility, the analysis tool will load the configuration file, copy log data from data source path to local workspace data folder, examine the files with the configuration settings, perform the analysis, and write the analysis result to a JSON report file. User should make sure the execution credential running tool has access to share folder defined in Configuration.json. When tool runs into access denial issues or the directory does not exist, the tool will exit with proper error message. After the tool copies the log files to the local tool directory, it scans the local data folder and compare with the AG configuration in Configuration.json. If it finds log data is not complete, the tool will show which files are missing. The tool will also alert the user that the failover root cause might not be properly identified due to missing files.

If analysis is successful, the utility persists the analysis result report as JSON file in Result directory. Analysis result will NOT be presented on the console, unless the "--Show" parameter is specified.

--Analyze - When "--Analyze" is specified as a parameter, the utility will load configuration file without copying log data. It assumes the log files have already been copied over. It does everything as default mode except copying log data. This option is useful if you already have the data in the local tool execution subdirectories and want to rerun the analysis.

- -Show -The utility after analyzing log data will display the results in the command console. Additionally, the results will be persisted to a JSON file in the results folder.

Currently supported failover root causes

The tool in current form can analyze and report failover reason if a failover is due to the following causes.

Planned Failovers

  • Stopping SQL service: Admin or other process gracefully shuts down SQL Server hosting the AG primary. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
  • Shutting down Windows Server: The Windows Server which hosts the SQL Server that has the primary replica is shut down or restarted. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
  • Manual failover: An administrator performs a manual failover through Management Studio or through a direct Transact-SQL statement "ALTER AVAILABILITY GROUP <AG name> FAILOVER" or "ALTER AVAILABILITY GROUP <AG name> FAILOVER WITH DATA_LOSS" on the secondary replica that will become the primary replica.

SQL Service Internal Health Issue

  • Too many memory dumps generated on the SQL Server hosting the primary replica: There are more than 100 SQL Server dumps on the primary replica since the last SQL Server restart, and there is at least one dump in the last 10 seconds. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Memory scribbler: SQL Server has a write access violation and the write address is more than 64KB. If there are more than three such memory corruptions since SQL Server was started, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Sick Spinlock: After an access violation, a spinlock is marked as sick if it backs off more than three times, which is the threshold. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Out of Memory: If no memory has been freed in 2 minutes, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Unresolved deadlock: If sp_server_diagnostics detects unresolved deadlock from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Deadlocked scheduler: If sp_server_diagnostics detects deadlocked schedulers from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.

System Wide Issues

  • Unexpected crash: SQL Server service was shut down unexpectedly. Resources host service (rhs.exe) does not detect lease check from SQL Server about availability group lease. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover.
  • Long dump: SQL Server is creating a dump file. During the process, threads handling the AG lease are frozen and which may result in a lease timeout. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover if automatic failover is enabled.
  • Quorum loss: AG resource is brought offline because quorum is lost. This could be because of a plethora of issues, which would need to be determined by a closure examination of the cluster logs.
  • Network interface failure: Network interface used to communicate between cluster nodes fails. Primary and secondary replicas cannot communicate. WSFC will initiate quorum vote and determine a new primary to fail over to.
  • Cluster disk failure: Network interface used to communicate between cluster nodes has failed, resulting in a communication loss between the cluster nodes. Cluster will initiate quorum vote and determine a new primary failover to.
  • Cluster Service paused: Cluster service on primary was paused, resulting in primary unable to communicate with other nodes in the cluster. Cluster will initiate quorum vote and determine a new primary to fail over to.
  • Node offline: When primary node is frozen or loses power, WSFC loses connection from and to the primary. Failover cluster decide to initiate failover and pick a primary from other possible nodes.
  • High CPU utilization: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is consuming 100% CPU and hence the lease renewal threads cannot execute. This is a best effort estimation based on the inputs from the Cluster logs.
  • High I/O: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is throttling the disks on the system and as such the lease renewal cannot be completed. This is again a best effort estimate based on the inputs from the cluster logs.

The Failover Detection utility can be used to analyze SQL Server 2012, 2014, 2016 and 2017 Availability groups, provided the servers are on the latest service packs and cumulative updates.

Update 11/21

A few users have reported the following error while using the utility. We are looking into this and will post an update shortly.

“An unhandled exception of type ‘System.BadImageFormatException’ occurred in FailoverDetector.exe
Could not load file or assembly ‘Microsoft.SqlServer.XEvent.Linq, Version=14.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91’ or one of its dependencies. An attempt was made to load a program with an incorrect format.”

Update 11/27

New updated binaries for the Failover Detector Utility have been uploaded to GitHub. Before running the utility, please install the "Microsoft Visual C++ Redistributable for Visual Studio 2017" from here. On the download page, scroll down to the "Other Tools and Frameworks" section to download the redistributable (x64 version).

You may encounter the "Strong name validation failed" error while using the utility. To resolve the error use sn.exe (available in the zip file) to disable strong name validation. To disable strong name use the following command

sn.exe -Vr <<path to the DLLs>>

 

Sourabh Agarwal
Senior PM, SQL Server Tiger Team
Twitter LinkedIn
Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam

Comments

  • Anonymous
    November 19, 2018
    Sourabh,This looks amazing. If my replicas have more than the default 6 SQL Server error log files,do I need them all or just the ones that cover the time of the failover I want to analyze?The Tiger Team is awesome!Thanks,Frank
    • Anonymous
      November 19, 2018
      You don't need to have all the error log files. We do not check for number of files, just check if the files are present. The Utility will try and provide an analysis as best as it could with the files you have provided. If the files are not sufficient, then it would display a warning.
  • Anonymous
    November 20, 2018
    I have tried this on 2 different computers. It runs through the log collection to the local directory but errors when it begins to analyze. "An unhandled exception of type 'System.BadImageFormatException' occurred in FailoverDetector.exeCould not load file or assembly 'Microsoft.SqlServer.XEvent.Linq, Version=14.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91' or one of its dependencies. An attempt was made to load a program with an incorrect format."I've tried everything I can think of to get passed the error to no avail. Any ideas? i'd love to get this working as I have already spent a decent chunk of time scripting out my log collections.
    • Anonymous
      November 20, 2018
      Kimberlin, Let me check and get back to you on this.
      • Anonymous
        November 20, 2018
        Hi Kimerlin, could you please register the Dll in the GAC and try again.
        • Anonymous
          November 21, 2018
          Have the same issue.Performed "C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.6.1 Tools"\gacutil -i to all 3 dll availableDoes not help
          • Anonymous
            November 27, 2018
            Hi, please refer to the latest update on the blog to resolve this issue.
        • Anonymous
          November 21, 2018
          I had the same issue. Was able to register the DLL ("Assembly successfully added to the cache") but the error persists.
          • Anonymous
            November 27, 2018
            Hi, please refer to the latest update on the blog to resolve this issue.
        • Anonymous
          November 23, 2018
          Sourabh,I'm having the same issue as Kimberlin... Do you have any solution for that rather than register the DLL?Thank you!Regards!
          • Anonymous
            November 27, 2018
            Hi Gonzalo, please refer to the latest update on the blog to resolve this issue.
            • Anonymous
              December 18, 2018
              Hi Sourabh!I tested the tool and it worked fine this time! Excellent tool! Thank you!!
        • Anonymous
          November 26, 2018
          I tried using gacutil and still the same error. Thanks for your quick response though.
          • Anonymous
            November 27, 2018
            Hi Kimberlin, please refer to the latest update on the blog to resolve this issue.
            • Anonymous
              November 28, 2018
              Mine is now working as well. Thanks for the updates. For anyone else reading, I had to apply the sn.exe -Vr to each of the XE DLLs explicitly as Cody mentioned. It took me a minute to figure that out. From admin command prompt I entered the directory where SN.exe and the DLLS live (C:\FOData for me) and ran the following C:\FOData>sn.exe -Vr Microsoft.SqlServer.XE.Core.dll andC:\FOData>sn.exe -Vr Microsoft.SqlServer.XEvent.Linq.dllAfter that it worked!Thanks again.
        • Anonymous
          November 27, 2018
          FYI I tried registering the bad DLL in the GAC and I still received the BadImageFormat exception.
          • Anonymous
            November 27, 2018
            Cody, please refer to the latest update on the blog to resolve the issue.
            • Anonymous
              November 27, 2018
              Thanks it's now working with the re-download, the VC++ installation, and running sn.exe on the two provided XE DLLs.
  • Anonymous
    November 21, 2018
    Sourabh, I just ran this successfully and the report shows the following message: Failover root cause cannot be determined in this case. However, when I look at the System log for the primary replica at time of failover, I see that the cluster lost communication with its file share witness which caused the cluster service to restart. Any idea why the results don't flag that as root cause? Also, is there any reason the log files cannot be copied directly into the Data folder? Thank you for making this available. It saved me a ton of time bouncing between logs to determine root cause.Frank
    • Anonymous
      November 21, 2018
      Thanks Frank!! The analysis is based on certain decision trees we built in the code and this sounds like one of the conditions we do not check in the decision tree (the complete list is in the blog). We do plan to collate the feedback and add more decision trees/condition in the future. Regarding your 2nd question. You can certainly copy all the data to the data folder and run the analysis with the --Analyze flag. In the --Show (or the default) mode the utility will try to copy the files from the Share (mentioned in the config file) to the local data folder.
    • Anonymous
      November 21, 2018
      Hi Frank,Please advise how have you made it running?Did you have any dll issue?Probably tell us more about your environment.
  • Anonymous
    November 22, 2018
    The comment has been removed
    • Anonymous
      November 27, 2018
      Hi Vivek, please refer to the latest update on the blog to resolve this issue.
  • Anonymous
    November 27, 2018
    The comment has been removed
    • Anonymous
      November 27, 2018
      Hi Rob, please refer to the latest update on the blog to resolve this issue.
      • Anonymous
        November 28, 2018
        I can confirm that the exe is working, I have also written a PowerShell wrapper to download and gather all of the required information in case anyone finds it usefulhttps://sqldbawithabeard.com/2018/11/28/gathering-all-the-logs-and-running-the-availability-group-failover-detection-utility-with-powershell/
        • Anonymous
          November 29, 2018
          It works. Thank you a lot for the best step-by-step explanation and automation
  • Anonymous
    November 27, 2018
    In the example structure you have both Replicax_system.log and Replicax_system.csv. Which is the right file extension?Also are there plans to open source this code?
    • Anonymous
      November 27, 2018
      CSV is preferred, but .log will also work. We do have a plan to open source the code in due time.
  • Anonymous
    November 28, 2018
    I keep getting:Unhandled Exception: System.IO.IOException: The process cannot access the file 'G:\FailoverDetectionUtility\Data\servername\AlwaysOn_health_0_131854776100910000.xel' because it is being used by another process.But I'm sure the file isn't in use, I can rename it or delete it but then get the same error on the next file. Running under administrator in cmd window and I'm owner of the file. Logged off and on again but still get the same error.
    • Anonymous
      November 28, 2018
      The comment has been removed