Partilhar via


SQL Server Intermittent Connectivity Issue

Recently many customers of an ISV I work with, reported intermittent connectivity issues when running the ISV application on SQL Server. Some customers reported the issue to be SQL Server 2005 specific. Others stated that they are experiencing the same issue on both SQL Server 2000 and 2005. Due to the intermittent nature, and the variation of the issue, it took us quite a while to collect all the data, (odbc trace, netmon trace, sql trace…), analyse it, and understand the exact cause.

SynAttackProtect

The first issue we found was a subtle Winsock behavior change in Window 2003 SP1. Windows 2003 SP1 introduces a configurable registry setting: SynAttackProtect, that protects the server from network Denial-Of-Service attacks. By default the protection is on. In a SQL Server environment, when the number of simultaneous client connection requests is more than the system can handle and SQL Server backlog queue is full, the client will receive a 'connection failed' error from SQL Server:

TCP Provider: An existing connection was forcibly closed by the remote host

The SQL Protocols team has a good Blog that explains the detailed interaction between SynAttackProtect setting and SQL Server. See https://blogs.msdn.com/sql_protocols/archive/2006/04/12/574608.aspx.

In Windows 2003, this issue could be worked-around by configuring the registry setting to disable SynAttackProtect.

1) Launch regedit.exe

2) Add DWORD value named SynAttackProtect under registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

3) Set data value to 0

See https://technet2.microsoft.com/windowsserver/en/library/8d3a9f4d-13d1-4280-ac57-30242504d8ba1033.mspx?mfr=true for additional information.

After applying the registry change, two customers reported the intermittent connection issue went away. Both customers had been stress-testing SQL Server. SynAttackProtect is more likely to become an issue in a lab environment where SQL Server is being stress-tested with extreme load. Customers running stress-testing in lab environments should turn off SynAttackProtect. I am not sure I’d recommend proactively turning it off in a production system given the potential security risk. If a production system ever runs into the SynAttackProtect issue, the where/why of the large number of connection requests should be examined first.

Windows “Scalable Networking Pack”

Windows Scalable Networking Pack was a second network stack change that was released as part of Windows 2003 SP1 + KB91222, or Windows 2003 SP2. With the Scalable Networking Pack, the TCP Chimney Offload feature is enabled by default to increase performance. However implementations on certain network cards are problematic when TCP Chimney Offload enabled, and can cause intermittent connection drop. When the connection is dropped due to incompatibility between the network card and Windows Scalable Networking Pack, typical error message is

[08S01] [Microsoft][SQL Native Client]Communication link failure

A workaround for this issue could be to disable to TCP Chimney Offload feature.

1) Launch regedit.exe

2) Edit DWORD EnableTCPChimney under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, set data value to 0

3) Edit DWORD EnableRSS under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, set data value to 0

4) Edit DWORD EnableTCPA under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, set data value to 0

5) Restart the server

Amongst others, network cards using Broadcom 5708 chipsets are known to have compatibility issues w/ TCP Chimney Offload. Disabling TCP Chimney Offload would fix the incompatibility issue. However it would also remove the benefit of Windows TCP stack performance improvements. It should only be used as a workaround until a fix becomes available from the network card vendor.

See support article https://support.microsoft.com/kb/942861 for additional information.

By now, the majority of customers solved the intermittent connectivity issue after applying the SynAttackProtect and/or TCP Chimney Offload changes. Some customers connecting to SQL Server through Citrix have to turn off the TCP Chimney Offload feature on the Citrix server to fix the issue.

Query Timeout

The particular ISV application runs with a configurable Query timeout, by default the timeout is set to 1 second. When a query timeout happens, the application will retry the query for 16 times, if it still fails, the query will be submitted again with nolock hint. How would that affect the connection? At the first glance, it seems to be irrelevant, but it does. Here’s the sequence of what could happen.

1) The query is submitted to SNAC for execution

2) The execution of the query takes too long so the timeout expires

3) Once the timeout happens, client attempts to cancel the query and sends an ATTN packet to the server. After sending ATTN, the client then waits for the response from the server. A timeout is set for this wait, the value of the timeout is obtained through SQL_ATTR_CONNECT_TIMEOUT and if not set, the default to 120 seconds.

4) The wait for server to response also times out. This is treated as an error from the client and basically the connection is considered dead in such scenarios

5) The client marks the connection as dead and then returns the “Query timeout expired” error message.

6) The application code, on seeing a timeout expired message attempts to execute the request again on the same connection, but immediately hits the “communication link failure” message because the connection is deemed as dead by the client

 

During the investigation, we discovered a regression in SQL Server 2005. SQL Server 2005 may not response to a query cancel (timeout) request in a timely manner if the query requires index scan through a large number of pages. Checking for any attention requests is delayed when the SQL Server storage engine is busy bringing pages in from disk. Resulting in

[08S01] [Microsoft][SQL Native Client]Communication link failure

A hotfix is available to fix the regression, see details in

https://support.microsoft.com/kb/945442.

A

ll the remaining customers had some long running batch queries that index scan a large table. This is just the last piece we needed to complete the puzzle. After applying the hotfix, all customers reported the problem solved.

Cross Posted from https://blogs.microsoft.com/mssqlisv

Comments