SharePoint 2007 and TokenAndPermUserStore Issues
In 2007, I was working on a large SharePoint 2007 project and we discovered that after the system ran for 3-4 days (or 24 hours of simulated stress), performance would suddenly degrade significantly (i.e. minutes for pages to load, if they would load at all). However, when we looked at all of the traditional performance indicators (memory, CPU, network throughput) on the web front-ends, application servers and database server, everything appeared healthy.
We ultimately tracked the problem down to a SQL Trace Flag 4618 needed to regularly purge the TokenAndPermUserStore cache.
This post discusses the symptoms and ultimately simple solution to the problem.
Configuration
Hardware
- 4 web-front ends
Windows Server 2003 SP1 (32-bit)
2 process web garden - 2 clustered database servers
SQL Server 2005 SP2 servers (64-bit)
"lots of memory" - Search was handled outside of SharePoint so there were no dedicated application servers
Usage Pattern
- 3 Site Collections
- 2,000 Publishing Sites
- ~18,000 Publishing Pages
Symptoms
- Significant performance degradation after 3-4 days of normal use or 24 hours of simulated stress
- Pages would take minutes to load or would simply time-out
- Web front-ends
- Low CPU usage
- Normal memory usage
- Normal network throughput
- Database servers
- Low CPU usage
- Normal memory usage
- Normal network throughput
In short, there were no obvious indications of a problem other than the slow response time...
Diagnostic Approach
On this project, I had the pleasure to work with a Microsoft Architect named Jay Gore. When Jay executed the following query against the SQL Server:
SELECT scheduler_id, current_tasks_count, runnable_tasks_count FROM sys.dm_os_schedulers WHERE scheduler_id < 255
The query returned results such as:
scheduler_id current_tasks_count runnable_tasks_count
0 19 16
1 35 31
2 36 31
3 33 29
4 33 29
5 35 31
6 33 30
7 35 31
A runnable task count of greater than zero for any substantial length of time is a problem. The observed behavior was that the database was only at 25% CPU usage but had 10 to 25 tasks in runnable state per scheduler and it was spending all it’s time waiting on SOS_SCHEDULER_YIELDS with greater than 90% signal waits vs. resource waits.
On a hunch, Jay cleared the caches (dbcc freeproccache) and suddenly performance returned to normal.
After further diagnosis, Jay traced the problem to the TokenAndPermUserStore and he was able to consistently resolve the problem by issuing the following command:
dbcc freesystemcache ('TokenAndPermUserStore')
Root Cause
By design, if SQL Server 2005 has "plenty of memory", the cache of dynamically compiled queries will continue to grow and "old items" will not be automatically purged. Eventually, the cache can grow so large that it takes a substantial amount of time to locate the appropriate compiled query.
This behavior is documented in the following Microsoft Knowledge Base articles: KB933564 and KB927396.
Resolution
To resolve this issue, we applied the SQL Server Trace Flag 4618 to the startup parameters for SQL Server 2005. This flag forces "old items" to be purged from the cache immediately.