Windows 2012 R2 server fails to establish outbound connections
Hi there,
It's been a very long while since I have blogged something here and it's time to come back and continue sharing our field experiences with the IT community hoping to shed light for similar problems.
I was tasked to deal with a customer problem where the end users were reporting various problems like "cannot access the file server, getting authentication prompts" and the IT admins were also observing various problems like the server wasn't properly applying GPOs, Netlogon service complaining about DC access issues and etc. At times, they were even able to manually reproduce the issue by issuing a "telnet DC-IP 389" command from the affected server.
There might be a lot of reasons behind, so I decided to collect a number of logs while the issue was reproduced:
a) TCPIP ETL trace:
You can collect it with the below commands on a Windows client/server: (from an elevated command prompt)
netsh start trace capture=yes scenario=internetclient
<<repro>>
netsh trace stop
b) Network trace:
This could be collected in different ways like using the above command, Wireshark, Network Monitor, Message Analyzer,...
c) Handle outputs
This could be collected as follows:
Note: Handle tool could be downloaded from the following link: https://technet.microsoft.com/en-us/sysinternals/handle.aspx Handle v4.1
handle.exe -a -u >> %computername%_handledetails.txt
handle.exe -s >> %computername%_handlesummary.txt
ANALYSIS:
========
The logs were collected while doing a repro with telnet command on the server. After the logs were shared with us, I checked various things to understand why the outbound connection might be failing (by the way, the file server not being able to authenticate the incoming users was also a side effect of this issue since the file server wasn't able to verify the client credentials via Netlogon secure channel)
1) I first checked network traces, but there were no outgoing connection attempts (TCP SYNs sent to the target server) which means the issue is local to the server itself
2) Then I checked the TCPIP ETL trace and observed the root cause:
Note: You can open up the ETL file that is generated as a result of running netsh command in Network Monitor or Message Analyzer
[0]03E0.5214::01/04/18-15:07:37.5237622 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..
[0]58F0.4558::01/04/18-15:07:51.8242042 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..
[0]04D8.072C::01/04/18-15:07:52.0110322 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use.. 1616260 [0]
...
Actually that clearly explained why the outbound connections were failing: PORT EXHAUSTION.
3) And the main reason behind the port failure was a socket leak caused by an outdated 3rd party AV software: (from handles.exe output)
Note: The process name was deliberately changed
92355 ABC.exe pid: 1148 NT AUTHORITY\SYSTEM
92517 144: File (---) \Device\Afd
92519 148: File (---) \Device\Afd
92627 220: File (---) \Device\Afd
92629 224: File (---) \Device\Afd
92633 22C: File (---) \Device\Afd
92635 230: File (---) \Device\Afd
92689 29C: File (---) \Device\Afd
92701 2B4: File (---) \Device\Afd
92703 2B8: File (---) \Device\Afd
92705 2BC: File (---) \Device\Afd
92707 2C0: File (---) \Device\Afd
92743 308: File (---) \Device\Afd
92755 320: File (---) \Device\Afd
92761 32C: File (---) \Device\Afd
92767 338: File (---) \Device\Afd
92771 340: File (---) \Device\Afd
92773 344: File (---) \Device\Afd
92779 350: File (---) \Device\Afd
92881 420: File (---) \Device\Afd
92897 440: File (---) \Device\Afd
92899 444: File (---) \Device\Afd
92927 47C: File (---) \Device\Afd
92929 480: File (---) \Device\Afd
92933 488: File (---) \Device\Afd
92935 48C: File (---) \Device\Afd
92941 498: File (---) \Device\Afd
92977 4E0: File (---) \Device\Afd
92993 500: File (---) \Device\Afd
93053 578: File (---) \Device\Afd
93073 5A0: File (---) \Device\Afd
93075 5A4: File (---) \Device\Afd
93077 5A8: File (---) \Device\Afd
93079 5AC: File (---) \Device\Afd
93093 5C8: File (---) \Device\Afd
93113 5F0: File (---) \Device\Afd
93145 630: File (---) \Device\Afd
93165 658: File (---) \Device\Afd
93167 65C: File (---) \Device\Afd
93175 66C: File (---) \Device\Afd
93195 694: File (---) \Device\Afd
93199 69C: File (---) \Device\Afd
93217 6C0: File (---) \Device\Afd
93219 6C4: File (---) \Device\Afd
93227 6D4: File (---) \Device\Afd
93239 6EC: File (---) \Device\Afd
93249 700: File (---) \Device\Afd
93253 708: File (---) \Device\Afd
93265 720: File (---) \Device\Afd
93269 728: File (---) \Device\Afd
93271 72C: File (---) \Device\Afd
93273 730: File (---) \Device\Afd
93275 734: File (---) \Device\Afd
93277 738: File (---) \Device\Afd
93281 740: File (---) \Device\Afd
93283 744: File (---) \Device\Afd
93285 748: File (---) \Device\Afd
93297 760: File (---) \Device\Afd
93299 764: File (---) \Device\Afd
93301 768: File (---) \Device\Afd
93305 770: File (---) \Device\Afd
93307 774: File (---) \Device\Afd
93313 780: File (---) \Device\Afd
93317 788: File (---) \Device\Afd
93321 790: File (---) \Device\Afd
93323 794: File (---) \Device\Afd
93327 79C: File (---) \Device\Afd
93329 7A0: File (---) \Device\Afd
93331 7A4: File (---) \Device\Afd
93333 7A8: File (---) \Device\Afd
93335 7AC: File (---) \Device\Afd
93339 7B4: File (---) \Device\Afd
93343 7BC: File (---) \Device\Afd
93355 7D4: File (---) \Device\Afd
93357 7D8: File (---) \Device\Afd
93359 7DC: File (---) \Device\Afd
93361 7E0: File (---) \Device\Afd
93365 7E8: File (---) \Device\Afd
93373 7F8: File (---) \Device\Afd
93383 810: File (---) \Device\Afd
93389 81C: File (---) \Device\Afd
…
RESOLUTION:
===========
So we advised the customer to update the 3rd party AV software. Apart from that, you can take the following actions to avoid possible port leak issues:
a) Please make sure that Windows OS runs with latest rollups/security updates
b) Please make sure that all 3rd party softwares are up to date (including Firewall, AV, backup or any kind of software that might have to frequently establish outbound connections)
c) Finally you may consider extending the port range for busy servers which are supposed to establish many outbound connections very frequently. The following is the maximum range that you can set, but you may extend the range in phases instead of maxing out at the very beginning: (from an elevated command prompt)
netsh int ipv4 set dynamicport tcp start=1025 num=64500
netsh int ipv4 set dynamicport udp start=1025 num=64500
and you can decrease the TCPTimedWaitDelay registry key on the servers: (you may lower it to 30 seconds)
https://technet.microsoft.com/en-us/library/cc757512(v=ws.10).aspx TcpTimedWaitDelay
The TcpTimedWaitDelay value determines the length of time that a connection stays in the TIME_WAIT state when being closed. While a connection is in the TIME_WAIT state, the socket pair cannot be reused. This is also known as the 2MSL state because the value should be twice the maximum segment lifetime on the network. To adjust the TcpTimedWaitDelay settings, you have to modify/create the registry settings as listed below:
Key: | HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters |
Value: | TcpTimedWaitDelay |
Data Type: | REG_DWORD |
Range: | 30-300 (decimal) |
Default value: | 0x78 (120 decimal) |
Recommended value: | 30 |
Value exists by default? | No, needs to be added. |
Note: This change requires a server reboot
Please note that the same techniques could be applied to virtually any Windows versions as of Windows 7/Windows 2008 R2 onwards easily.
Hope this helps
Thanks,
Murat