Problem on Updating ASP.NET Web Content in Failover Cluster
ASP/ASP.NET allows you to change web site content when the site is alive on IIS. But recently we find a problem in IIS failover cluster.
Issue
This is a failover cluster of two IIS 6 servers, and we have a test web site on them which contains a simple HTML page and an ASPX page. The web site is placed on a disk resource in the failover cluster.
The page content is exactly the same, a line of text “ABC”.
Then when we open IE on a third machine and navigate to the three pages with cluster name in the address bar (such as https://clustername/test.htm), we see “ABC” is displayed. Note that actually we are accessing the IIS site on the active node (let’s name it Node A).
Now we force the cluster to fail over to Node B. OK, it is time to use Notepad to modify the pages and replace “ABC” with “123”. Done? This time IE will show “123”. Right?
The problem only happens if we fail over to Node A again. What you see now if IE refreshes?
In my test box, the HTML page is displaying “123” as expected, but “ABC” is still displayed in the ASPX page. Why?
Analysis
The first step I tried is to check the files on the disk resource. The strange thing was that the files are up-to-date. So it can be a cache issue. Then I simply reset IIS with “iisreset” in a command prompt. OK, this time IE shows the right page.
Well, why ASP.NET does not pick up the changes if it works fine in a single server environment (in fact, it also works fine in NLB cluster)? We employed a lot of tools and finally narrow the problem down to File Change Notification (FCN) mechanism, which is used by ASP.NET to monitor changes of web content.
When we access an ASP.NET site on one node, a w3wp.exe is created and launched to host ASP.NET and automatically the web content is monitored by ASP.NET. If there is any change received from FCN, w3wp.exe will clear ASP.NET cached content and use latest web content.
However, if the failover happens and the disk resource is moved to another node, the previous FCN registration on the node will become invalid. As a result, even if the node becomes active after another failover, IIS/ASP.NET has no way to know if the web content was changed and old web content is served until we restart IIS (or recycle the application pool).
Resolution
An interesting thing to notice is that you will only meet this problem if the failover happens frequently. By default IIS will stop idle worker processes on the passive node after 20 minutes. The ASP.NET cached content will be cleared once new worker processes appear.
So the possible workarounds are,
1. Reset IIS on Active Node After Failover
Its disadvantage is that services are interrupted.
2. Reset IIS on Passive Node Before Failover
This has a smaller impact on the services, as passive node does not serve incoming requests after failover. However, a person must be involved in the process to perform the reset, which is not optimal.
3. Decrease Idle Time Allowed For Worker Process
If there is no running w3wp processes on the passive node, then this problem will not happen after failover. This gives us the third workaround. That is, we can change IIS 6 application pool settings so that IIS shuts down idle worker processes on the passive node in a shorter time than default (20 minutes). Then if the failover frequency is larger than this idle time setting, we can also prevent the problem from happening.
There is still some drawback for this workaround, as in this way the worker process can be shut down more frequently, which may not be optimal for ASP.NET applications which have to be recompiled at startup and the recompilation takes time.
4. Use a Customized IIS Cluster Script
This is by far the most efficient way we found. The background is that we add a custom section into the standard IIS cluster script clusweb.vbs (modifications are in Offline function) in order to shut down existing w3wp processes during failover, so there would be no more w3wp.exe left on the passive node.
‘sample code for IIS 6
Function Offline( )
strComputer = "."
Set ObjWMIService= GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" _
& strComputer & "\root\cimv2")
Set w3wpProcessList = objWMIService.ExecQuery _
("Select * from Win32_Process Where name = 'w3wp.exe'")
For Each w3wpProcess in w3wpProcessList
w3wpProcess.Terminate()
Next
End Function
‘for IIS 7 there is another way
Dim STOP_APP_POOL
STOP_APP_POOL = 1
'Start the application pool for the website
Function StopAppPool()
Dim ahwriter, appPoolsSection, appPoolsCollection, index, appPool, appPoolMethods, startMethod, callStartMethod
Set ahwriter = CreateObject("Microsoft.ApplicationHost.WritableAdminManager")
Set appPoolsSection = ahwriter.GetAdminSection(APPLICATION_POOLS_SECTION_NAME, CONFIG_APPHOST_ROOT)
Set appPoolsCollection = appPoolsSection.Collection
index = FindAppPoolIndex(appPoolsCollection, APP_POOL_NAME)
Set appPool = appPoolsCollection.Item(index)
'See if it is already stopped
If appPool.GetPropertyByName("state").Value <> 1 Then
StopAppPool = True
Exit Function
End If
'Try To stop the application pool
Set appPoolMethods = appPool.Methods
Set startMethod = appPoolMethods.Item(STOP_APP_POOL)
Set callStartMethod = startMethod.CreateInstance()
callStartMethod.Execute()
'If stop return true, otherwise return false
If appPool.GetPropertyByName("state").Value <> 1 Then
StopAppPool = True
Else
StopAppPool = False
End If
End Function
Function Offline( )
StopAppPool()
Offline = true
End Function
Also note that IIS NLB cluster does not experience such a problem and has its other advantages over failover cluster, so it is highly recommended to use NLB cluster for HA set up (more information is provided in KB970759).
Last Question
Does this problem also apply to ASP pages? You can set up a test environment to have a look.
Regards,
Lex Li