TCP Offloading again?!
I have spent probably hundreds of hours on cases involving TCP Offloading and I know most of the signs (intermittent dropped connections, missing traffic in network traces). However, I have to admit I got burned by it the other day and spent several more hours working an issue than I should have.
I was working on a server-down case for a financial trading company (in other words, large dollars involved every minute they were down) where the customer was experiencing slow connections to SQL Server. The customer reported only some Linux ODBC clients were impacted. Based on that description, we started looking at the client side. However, we soon discovered that, while there was no detectable correlation between the clients, the problem was only visible going to a specific SQL Server instance. The affected clients had no problem communicating with other instances of SQL Server. Based on this, we started focusing on the SQL Server machine itself.
From the client application’s perspective, every query was taking roughly five seconds longer than expected. Therefore, we collected a PSSDiag and looked at the performance of the SQL Server machine as a whole. The Profiler traces showed that there was no delay inside SQL Server:
So, where were the five seconds coming from?
The next step was to look at a network trace:
Check out the two sets of timestamps circled. Both of them had a five second delta! Now we had physical proof of the problem, but we still don’t have a reason…
Then, I noticed something that turned out to be the key – the five second delay was always between the data sent from the client and the server’s response to that data. That clinched the fact that this was a server-side is 100%. I couldn’t explain yet why only some clients were impacted, but this was definitely a server-side issue. The other interesting thing to notice above is that the delay is even visible on the login! This was completely surprising because this customer was using SQL Authentication. That is a highly optimized query which should never have performance issues. This, combined with the fact that the subsequent query wasn’t showing up inside SQL Server as being delayed caused me to start thinking about things outside of SQL Server.
The next thing to check was for filter drivers that might have inserted themselves in the TCP stack – antivirus, firewall, NIC teaming, etc. Unfortunately, nothing like this was installed so there were no clues there. We also reconfirmed that TCP Chimney was turned off at the OS level. And then it hit me…NIC level TCP Offloading!!!
We pulled up the Ethernet adapter settings (Network Connections –> LAN XXX –> Properties –> Configure –> Advanced) and saw something that looked like this:
Lo and behold – TCP Offloading was enabled!
We disabled all of the Offloading settings, clicked OK and performance was back to normal. Connections were fast and query results were returning right after SQL Server generated them. I should point out that we didn’t take a stepwise approach here because this customer was losing large amounts of money every minute this system was down. In a less critical issue, it would be worth doing each setting one at a time and testing in between. In addition, I would also recommend that you go back after the fact and test enabling each setting to see if there is a negative impact. There are some non-trivial performance benefits to be gained from these settings if everything is working properly.
We never did figure out why only some clients were impacted since all of the clients were using the same driver. Nor where we able to figure out why only this SQL Server instance was impacted when several other SQL Server machines were configured the same way at the driver level.
The moral of the story? I need to update my standard steps for capturing network traces to include NIC level TCP Offload settings!
As of this morning, my first four steps for capturing a network trace now look like this:
1a. Turn off TCP Chimney if any of the machines are Windows 2003
Option 1) Bring up a command prompt and execute the following:
Netsh int ip set chimney DISABLED
Option 2) Apply the Scalable Networking Patch - https://support.microsoft.com/default.aspx?scid=kb;EN-US;936594
1b. Confirm that TCP Chimney is turned off if any of the machines are Windows 2008 (see https://support.microsoft.com/default.aspx/kb/951037 for more details)
a) bring up a command prompt and execute the following:
netsh int tcp show global
b) if it turns out TCP Chimney is on disable it
netsh int tcp set global chimney=disabled
2. Turn of TCP Offloading/Receive Side-Scaling/TCP Large Send Offload at the NIC driver level
3. Retry your application. Don't laugh - many, many problems are resolved by the above changes.
Evan Basalik | Senior Support Escalation Engineer | Microsoft SQL Server Escalation Services
Comments
Anonymous
February 21, 2010
The comment has been removedAnonymous
February 22, 2010
Hi, I've been reading up on the TCP Chimney offload and the impacts on Windows Server 2003 SP2: http://support.microsoft.com/default.aspx?scid=kb;EN-US;948496 http://statisticsio.com/Home/tabid/36/articleType/ArticleView/articleId/305/TCP-Chimney-Offload.aspx http://support.microsoft.com/default.aspx?scid=kb;EN-US;912222 http://blogs.msdn.com/psssql/archive/2008/10/01/windows-scalable-networking-pack-possible-performance-and-concurrency-impacts-to-sql-server-workloads.aspx I'm considering disabling this feature for all of my servers that run this OS. There doesn't seem to be any benefits only negative impacts.Anonymous
February 22, 2010
From a Microsoft perspective, we have certainly learned from our mistakes in this area. Windows 2008 ships with the feature off by default and Windows 2008 sets the default behavior based on the NIC speed (http://technet.microsoft.com/en-us/library/dd883262(WS.10).aspx#BKMK_chimney). I cannot comment on the driver vendors, but I would hope they are listening to feedback both from customers and Microsoft. My general recommendation is to leave TCP Offloading off unless you find yourself stressing your server to the extent that the potentical increased networking performance is worth enabling it.Anonymous
February 22, 2010
Oops - the comment above was supposed to say "...Windows 2008 ships with the feature off by default and Windows 2008 R2 sets the default behavior based on the NIC speed..."Anonymous
February 22, 2010
could you confirm that " Turn of TCP Offloading/Receive Side-Scaling/TCP Large Send Offload at the NIC driver level " must be done at the card and/or does " netsh int tcp show global " show the status of this? The network guys are saying this is disabled but if I open the nic settings it shows the IPv4 Checksum and IPv4 Large Send Offloads as both being enabled. Sorry, configuring TCPIP isn't one of my more usual skill sets as a DBA !Anonymous
February 23, 2010
Is any good reason to disable IPv4 Checksum Offload? This parameter is not mentioned in the article.Anonymous
February 25, 2010
Why is it soooo hard to get this feature working correctly after all these years ? And if it can't be made to work then why doesn't every vendor just drop the idea ?Anonymous
February 27, 2010
On our 2003 R2 SP2, the NIC's Advanced doesn't even have TCP Checksum Offload, so I think we're find, though I second the question about the one we do have, IPv4 Checksum Offload. "netsh int ip show offload" comes back only with this, which I think reflects that there is no TCP Checksum Offload. Offload Options for interface "Server Local Area Connection" with index: 10003:Anonymous
March 01, 2010
You have to be careful - the netsh command only shows the OS state. You need to check the card properties to see the state of the driver. I have not seen any issues with TCP Checksum Offload.Anonymous
March 02, 2010
I have over 4000 servers to check for these properties being on. Does anybody know how to query WMI to check for these TCP Offload settings? An automated way to change the settings?Anonymous
March 09, 2010
The comment has been removedAnonymous
March 11, 2010
@HighPockets We would hate to see you alter settings on over 4000 servers. In general I would say that if you are not seeing an issue with your servers, then you shouldn't need to alter anything. I think the point of this blog was if you do notice a Performance issue, it may be a result of the above. It should be looked at on a case by case basis and not a blanket change to your environment. Thanks, Adam W. SaxtonAnonymous
September 27, 2010
Can you give more details on what to disable on the NIC? As another poster had asked, should IPv4 Checksum Offload also be disabled? In my case I've disabled the TCP Chimney in the OS and also disabled Large Send Offload v2 (IPv4), Large Send Offload v2(IPv6), Receive Side Scaling, TCP Checksum Offload (IPv4), , TCP Checksum Offload (IPv6). Also what about UDP Checksum Offload? Should that be disabled as well? Thanks, JeffAnonymous
March 28, 2011
The comment has been removedAnonymous
August 04, 2011
I am curious. Besides the previously mentioned Offload options, we have a TCP Connection Offload option for the integrated Broadcom NICs on our HP servers and I was wondering if this option should be disabled (Tested) with the previously recommended disabled Offload options?Anonymous
August 22, 2011
Is there a way to script the disable of this TCP offload for Intel and Broadcom drivers? I want to add this to all my standard windows build scripts. I am sick and tired of running into this issue myself. Microsoft really needs to revisit this, it is terrible.Anonymous
December 14, 2011
You said: "the five second delay was always between the data sent from the client and the server’s response to that data" But that's not what the timestamps circled in red appear to show. Both samples of 5 second delay between packets show a delay between one packet sent by the server and the next packet sent by the server. The delay between client query and server response appears to be around 160ms (hard to tell because the red circles obscure the timestamps).Anonymous
February 24, 2012
Looks like this solution is designed for a network with late model network switches where both servers are windows with TCP offload enabled on the NIC and in the OS......www.broadcom.com/.../5709-WP101.pdf...Unfortunatley a windows server can also communicate to non windows servers or servers where the feature is not enabled.Anonymous
May 11, 2012
The comment has been removedAnonymous
May 12, 2012
I am happy to report so far so good.Anonymous
September 11, 2012
You changed something and clicked OK - that resets the driver and some things - that cause the fix in my opinion.Anonymous
February 15, 2013
For the latest guidance on SNP, please check the following article: www.windowsitpro.com/.../give-microsofts-scalable-networking-pack-140350 Ramu (Microsoft SQL Server Support)Anonymous
October 11, 2013
We are one of the fastest growing technologies to develop all latest appliances solutions Deep packet inspection, Ethernet 10gb and Low Latency. visit-http://www.intilop.com/Anonymous
November 22, 2013
Is disabling any of these disruptive? In other words does it reset the NIC and interfere with Network traffic from/to the server?Anonymous
December 11, 2013
Intilop has several groups who can take on projects that range from a small 100k Gate FPGA design/integration to 10 M gate SOC Design/integration/Verification project or a small 2 inch X 2 inch board for an embedded design application to 22 inch X 26 inch, 24 layer multi-Giga bit blade server Board with multiple 1000+ pin BGA devices. <a href="http://intilop.com">full tcp offload</a>Anonymous
January 13, 2014
Intilop's innovative ideas make them the most respectable IP developer firm. This technology is especially designed for banks, financial institutions, data centers, cloud infrastructure, network equipment, defense/ aerospace platforms. ThanksAnonymous
March 26, 2014
Thanks, I needed that. I was looking at some of my Windows 2000 servers that don't have the TCP_CHIMNEY_OFFLOAD turned on, but as it turns out, they had it turned on at the NIC. Great discoveryAnonymous
December 15, 2014
I had run into this problem briefly, and applied changes to my Win 8.1 PC. But today I ran into such dramatic problems installing from files on my networked data server that I asked Google, found this article, and disabled TCP and UDP and checksum offloading on the NIC on my Win 2008 R2 data and archive server. Trying to copy a large folder for a Symantec install to a local drive was taking an hour. Seriously. After the change, I deleted from local drive, tried again, and the whole folder was copied in 15 seconds. Flat. I had been motivated to figure this out trying to run the Symantec install from the image on the server drive, and it had simply timed out, croaked, and locked up. I had to clean up the failed install. This time, it worked in a matter of minutes. I am making these changes on NIC on all my servers and machines today. Seriously, this has solved the growing bad performance problems I have been observing across the board.Anonymous
December 16, 2014
I too have observed these settings to cause issues in numerous instances, too many to count. I do however have a question regarding Large Send Offload. Has this been identified to have similar effect to TCP Offload?Anonymous
October 20, 2015
we get: "Data provider or other service returned an E_FAIL status. " in our software query it from Excel : "[Microsoft][ODBC SQL Server Driver][DBNETLIB] General Network error. Check your network documentation" (customer has mssql 2008 and terminalserver 2008r2) THXAnonymous
January 05, 2016
Nice ideas - I Appreciate the facts . Does anyone know where my assistant can obtain a fillable a form example to edit ?Anonymous
September 03, 2016
B