Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus

As the title suggests this post is going to be about a specific issue I came across at a customer site recently in a new deployment of SharePoint 2013 that relates to the distributed cache service. Definitely one of the more challenging ones to troubleshoot from what I have seen before so I figured I should capture the result here in case it helps someone else. So here is the situation. We had a SharePoint farm that had a number of web front end servers and we had chose to run the distributed cache service on 2 dedicated servers in the farm. In our initial testing we saw that the performance of SharePoint was well below what we would have expected for a farm of the scale of the one we had deployed, and we needed to get to the bottom of it. So in typical fashion I started with the developer dashboard to identify the slow loading part of the page, and was able to see the problem was related to the SharePoint claims provider in the authentication validation part of the page. Taking to the ULS logs to find out more detail, we were seeing a lot of the below error (I've truncated the stack trace because it does go on a fair way).

 Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedLogonTokenCache'
 - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a 
temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy 
network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has 
been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache 
hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) 
---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing 
your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout 
was '10675199.02:48:05.4775807'. ---> System.IO.IOException: The read operation failed, see inner exception. 
---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing 
your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket 
timeout was '10675199.02:48:05.4775807'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by 
the remote host 
 at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) 
 at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing) -
 -- End of inner exception stack trace --- 
 at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing) 
 at System.ServiceModel.Channels.SocketConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout) 
 at System.ServiceModel.Channels.ConnectionStream.Read(Byte[] buffer, Int32 offset, Int32 count) 
 at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count) 
 at System.Net.Security.NegotiateStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) 
 at System.Net.Security.NegotiateStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) 
 at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) -
 -- End of inner exception stack trace --- 
 

This was happening in line with pretty much every request for content in SharePoint, so represented a pretty big problem. Looking at the details in the exception it appeared to be that the communication was getting from the WFE servers to the cache servers correctly, but they were actively refusing the connection - leaving SharePoint without a cache, causing the performance issue.

After doing some reading up on the expectations of the AppFabric service, validating it was configuring correctly (and even recreating the cache from scratch) I was still seeing the problem. Spending some time talking with a couple of my colleagues though put me on to the right path - if the cache servers were actively refusing the connection there must be a reason for it, and permissions was the first thing that came to mind. So step one was to validate the permissions for the cache using the PowerShell command Get-CacheAllowedClientAccounts. This returned two group names, "WSS_WPG" and WSS_ADMIN_WPG". A quick look at these groups and I could see that the SharePoint farm account was in there, and also the account that was running my distributed cache service was in there as well. The glaring omision though - the account that runs the application pool for my SharePoint sites.

The solution, we took the account that runs the application pool for the SharePoint web applications and we added it to the WSS_WPG group on just the dedicated distributed cache servers. As soon as this was done the errors stopped in the ULS logs and we noticed that the page load times went from over 6000ms to less than 200ms - a pretty big difference! So there it is, hopefully that saves someone else a bit of time if you come across the same issue.

Comments

  • Anonymous
    June 24, 2014
    Brian, giving the App Pool account membership to the WSS_ADMIN_WPG is giving it too much rights.It should be in the WSS_WPG (worker process group) in the first place and that should be sufficient for the distributed cache,This TechNet article specifically defines that app pool accounts should be in the WSS_WPG, and as you pointed out that group should be able to interrogate the cache: technet.microsoft.com/.../cc678863(v=office.15).aspxR.

  • Anonymous
    June 24, 2014
    Hi Radi - thanks for the pick up, I've updated the post accordingly as you are correct :)

  • Anonymous
    June 25, 2014
    Hi Brian,I'm wondering if you applied one of the recent AppFabric CUs to this farm (CU 3, 4, or 5). I've seen this same error before (along with others) and resolved it after installing the CU and enabling the background garbage collection.What's interesting to me is in your exception stack it shows a timeout communicating with the cache host. I wonder why it's not throwing an access denied exception. My best guess (since I can't see the code :) is that the connection times out because the authentication keeps failing.Thanks for this post!

  • Anonymous
    June 25, 2014
    The comment has been removed

  • Anonymous
    January 13, 2015
    BOOM!  Thanks for this, saved me.

  • Anonymous
    March 16, 2015
    You LEGEND!!This resolved the problem for me too :)

  • Anonymous
    January 06, 2016
    Since AppFabric CU7 didn't fix my problems, I opened a case with Microsoft Support's SharePoint Admin team in December 2015 on this. The ErrorCode<ERRCA0017>:SubStatus<ES0006> was solved by setting MaxConnectionsToServer to 1 for all caches. Our token cache had it set to 100. Not good! Reference: technet.microsoft.com/.../jj219613.aspx  

  • Anonymous
    April 11, 2016
    Thank you for the informative post.