Condividi tramite


Improved AlwaysOn Availability Group Lease Timeout Diagnostics

When your AlwaysOn availability group is configured for automatic failover, you may find your availability group failed over, or if configured for manual failover, you may observe your availability group transition from the PRIMARY role to the RESOLVING role, during which users cannot access the availability group databases.

Checking the SQL Server error log, you find that a lease timeout occurred, and SQL Server reports the error 19407 in the SQL Server error log:

2016-01-24 14:56:37.51 Server      Error: 19407, Severity: 16, State: 1.
2016-01-24 14:56:37.51 Server      The lease between availability group <ag> and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

Lease timeout is a very common health issue detected by AlwaysOn health detection. Additional diagnostics have been introduced to help diagnose the root cause of lease timeout events.

Lease Health Check is One Element of AlwaysOn Health

AlwaysOn availability groups implement a failover policy that monitors the health of SQL Server, and is scalable using the availability group’s FAILURE_CONDITION_LEVEL property. The lease is one of several health checks AlwaysOn health detection uses to monitor SQL Server health.

The FAILURE_CONDITION_LEVEL can be set from 1 to 5 and as you increase the property value, additional checks on SQL Server health are cumulative. The lease health is checked at FAILURE_CONDITION_LEVEL=1, therefore, because of the cumulative behavior, the lease is monitored at all FAILURE_CONDITION_LEVELs of an availability group, one through five.

What kind of Health Issues Cause Lease Timeout?

In SQL Server, the dedicated thread for the lease mechanism is preemptive and outside normal SQLOS control running at higher priority. Therefore, outside of an event in which SQL Server’s dump diagnostics takes place, where the SQL Server process is frozen to produce a user dump of the process, a lease timeout indicates that there is a system wide issue that caused the lease timeout. Examples of this kind of system wide event which could result in lease timeout are sustained 100% cpu utilization or an unresponsive disk subsystem.

For more information on how the AlwaysOn lease monitors health check out:

How It Works: SQL Server AlwaysOn Lease Timeout

For more information on diagnosing AlwaysOn health detection including lease timeouts, check out:

Diagnose Unexpected Failover or Availability Group in RESOLVING State

 

8154.image_6DC006A7

 

Introducing New Lease Diagnostics

Lease timeout is the single most common issue detected by AlwaysOn health detection. Therefore, additional diagnostics have been introduced to help diagnose the root cause.

SQL Server Builds Introducing the New Lease Diagnostics

The lease timeout diagnostic improvements have been added to the following builds of SQL Server:

SQL Server 2012 Service Pack 3
SQL Server 2014 Service Pack 2 (upcoming)
SQL Server 2016

IMPROVEMENT New SQL Server Messages Help Diagnose Lease Timeout

To provide additional insight, new messages have been added to SQL Server. Below is a list and explanation for what each provides.

Error

Error Message

Cause

Corrective Action

19419 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because the existing lease is no longer valid. The lease worker on the SQL Server side did not get scheduled on time to process event signal from the cluster. Check the CPU utilization on the server as SQL Server lease worker seems to be starving.
19420 The availability group '%.*ls' is explicitly asked to stop the lease renewal. The lease renewal is stopping as a part of bringing the availability group offline. This is informational only.
19421 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because renewal didn't happen within lease interval. The lease helper on the cluster side did not signal the SQL Server lease worker on time. Check corresponding availability group resource in WSFC cluster to see if it reported any error.
19422 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because of a windows error with Error code ('%d'). The lease worker on SQL Server side failed to renew the lease because of a windows error. Check windows error code and take the corrective action.
19423 The lease of availability group '%.*ls' lease is no longer valid to start the lease renewal process. When the lease worker started processing the excess lease time provided by online call the lease was already expired. This might happened because of scheduling issues. Check the CPU utilization on the server as SQL Server lease worker seems to be starving.
19424 The lease worker of availability group '%.*ls' is now sleeping the excess lease time (%u ms) supplied during online. This is an informational message only. No user action is required. Informational. Extra online time allotted to starting the lease renewal thread and as part of the availability group online routine.

 

Example 19419 As an example, I use a debugger to attach to SQL Server, which interrupts any servicing of threads in the SQL Server process until I resume the SQL Server process. When I resume SQL Server, the following is reported in the SQL Server error log:

2016-02-16 11:37:03.05 Server      Error: 19419, Severity: 16, State: 1.
2016-02-16 11:37:03.05 Server      Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'ag' within the lease timeout period.
2016-02-16 11:37:03.07 Server      Error: 19407, Severity: 16, State: 1.
2016-02-16 11:37:03.07 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
2016-02-16 11:37:03.07 Server      AlwaysOn: The local replica of availability group 'ag' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
2016-02-16 11:37:03.07 Server      The state of the local availability replica in availability group 'ag' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

The 19419 is raised because SQL Server was not responsive to Cluster. We also get the 19407 error which reports the lease timeout.

Example 19424 Here is the excess lease time message reported just prior to the availability group transitioning to PRIMARY role.

2016-02-16 11:31:54.76 Server      The lease worker of availability group 'ag' is now sleeping the excess lease time (164766 ms) supplied during online. This is an informational message only. No user action is required.
2016-02-16 11:31:54.80 Server      The state of the local availability replica in availability group 'ag' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

 

IMPROVEMENT Extended Event Reports Lease Timeout With Lease Stages

The availability_group_lease_expired and hadr_ag_lease_renewal and XEvents have been improved, with the addition of data points that provide more information on the condition of the lease.

availability_group_lease_expired hadr_ag_lease_renewal
image image

 

When availability_group_lease_expired XEvent is raised, expect that the current_time to be greater than the new_timeout, which is the current deadline that the lease must renew by. In the above example, calculating current_time – new_timeout, indicates that the lease expiration was reported 8,203 ms after the timeout deadline.

The hadr_ag_lease_renewal XEvent will report RenewSucceeded state every five seconds.

Here is more information on the new data points for each of these XEvents:

XEvent New Column Description
availability_group_lease_expired current_time Time at which the lease expired
availability_group_lease_expired new_timeout Time out time, when availability_group_lease_expired is raised, current_time is greater than new_timeout
availability_group_lease_expired state Lease stages: see Lease Stages table below
hadr_ag_lease_renewal state hadr_ag_lease_renewal
hadr_ag_lease_renewal error_code If state is HadrLeaseRenewal_FailedWithWindowsError then error_code is the Windows error code associated with the failure

 

Lease Stages and definitions

The following table lists the possible lease stages and explains their function.

Stage Name Description
HadrLeaseRenewal_LeaseWorkerStarted Lease worker thread started.
HadrLeaseRenewal_StartedExcessLeaseSleep Starting excess lease. Excess lease stages document the starting of the lease thread during the online phase of the availability group.
HadrLeaseRenewal_FailedExcessSleepInvalidOnlineLease We fail the excess lease if the lease is already expired.
HadrLeaseRenewal_SkipExcessSleep We skip the excess lease if the duration available to sleep is less than the lease interval. There is no need to go through the excess lease just start the hand shake process.
HadrLeaseRenewal_ExcessSleepSucceeded Excess lease succeeded.
HadrLeaseRenewal_RenewSucceeded We should see this with every renewal.
HadrLeaseRenewal_LeaseNotValid Equivalent to error: 19419Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group '%.*ls' within the lease timeout period.
HadrLeaseRenewal_StopLeaseRenewal You should see this during a failover event.
HadrLeaseRenewal_LeaseExpired Equivalent to error: 19421SQL Server hosting availability group '%.*ls' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
HadrLeaseRenewal_FailedWithWindowsError Lease renewal failed because of a windows error.

 

IMPROVEMENT System Performance Data Reported On Lease Timeout

IMPORTANT This improvement was not ported to SQL Server 2012, it will only be available in SQL Server 2014 Service Pack 2 and SQL Server 2016.

Lease timeouts are often a symptom of a system wide event that prevents lease renewal for a lease timeout period. Unresponsive disk subsystem, 100% CPU utilization, or in the case of virtualization, contention on the host server, are examples of issues that could trigger the lease timeout.

In the event of a lease timeout, the following system performance objects are reported in the Cluster log: Processor time (%), Available memory (bytes), Avg disk sec/ read and Avg disk sec/write are now reported. The last five retained sets of performance data are reported, whose collection interval is dictated by the frequency at which sp_server_diagnostics results are processed by the SQL Server resource DLL (hosted in RHS.EXE), which by default, is every 10 seconds.

Below is an example of the data set. Assuming 10 second interval, the performance data reports on these objects for the last 50 seconds. The first message below reports the lease timeout. The next message is a heading of the forthcoming performance data in the next five messages, so you can read the output like a table:

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] Lease timeout detected, logging perf counter data collected so far

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] Date/Time, Processor time(%), Available memory(bytes), Avg disk read(secs), Avg disk write(secs)

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] 9/3/2015 17:19:56.0, 3.216014, 2219802624.000000, 0.001333, 0.000355

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] 9/3/2015 17:20:6.0, 3.010540, 2219888640.000000, 0.001150, 0.000963

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] 9/3/2015 17:20:16.0, 5.524332, 2224078848.000000, 0.000000, 0.000697

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] 9/3/2015 17:20:26.0, 3.334432, 2224504832.000000, 0.001367, 0.000438

00002968.000019c8::2015/09/03-17:20:40.052 WARN  [RES] SQL Server Availability Group: [hadrag] 9/3/2015 17:20:36.0, 9.189380, 2225078272.000000, 0.001367, 0.000834

IMPROVEMENT Extended Event Reports Lease Timeout With Lease Stages

The cluster log has a new message that reports the timeout and the expected renewal deadline that was violated. Here is the message.

The lease is expired. The lease should have been renewed by (year)/(month)/(day)-(hour):(min)(sec).(msec)

Example When a 19407 is reported in the SQL Server error log, a corresponding event is now recorded in the Cluster log.

Here is an example of the 19407 reported in the SQL Server error log. Note the timestamp of the 19407 in the SQL Server error log and the timestamp of the cluster log message. SQL Server is reporting local time and the Cluster log is reporting in UTC, which explains the seven hour difference. However, importantly, the cluster log reports the lease expired 10 seconds after its expected renewal (21:04:18.710 versus 21:04:08.711) which makes sense since that would be the default timeout period (20 seconds divided by 2).

2016-02-16 13:09:17.63 Server      Error: 19407, Severity: 16, State: 1.
2016-02-16 13:09:17.63 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

Here is the corresponding error reported in the Cluster log (note the time is in UTC):

00000690.00001330::2016/02/16-21:04:18.710 ERR   [RES] SQL Server Availability Group <ag>: [hadrag] The lease is expired. The lease should have been renewed by 2016/02/16-21:04:08.711

Comments

  • Anonymous
    December 14, 2016
    You say "By setting FAILURE_CONDITION_LEVEL to 0, you can temporarily disable the lease timeout health check, but Microsoft only recommends this for temporary periods when diagnosing the root cause of the lease timeouts."I'm trying to temporarily disable that check while diagnosing the root cause, but my server will not allow me to change that value to anything but 1-5. Is there a special step I need to take.
  • Anonymous
    March 08, 2017
    The improved diagnostics are helpful. Using them, we learned that the occasional heartbeat problems we had are due to CPU contention (Error 19419). The corrective action that Microsoft recommends for this isn't very helpful. "Checking the CPU Utilization" doesn't do much to solve the problem.Are you saying Add more cores? In some cases (you're using SQL Server 2016 Standard and have already maxed out the cores) that can be a really expensive option. In our particular case, we're very happy with the length of time it takes our server to get the workload done. The only reason why we would need to add cores is so we don't get the AAG heartbeat failures. Why doesn't the SQL Server Lease Worker get dedicated CPU? Since it is critical to AlwaysOn being "always on", it seems that server load shouldn't be able to starve the worker and defeat the purpose of AlwaysOn. I hope Microsoft will consider this suggestion to improve the quality of the solution.
  • Anonymous
    March 11, 2018
    Thanks for explaining Lease timeout in details.