How the Chinese UI in SharePoint turns to English randomly

How the Chinese UI in SharePoint turns to English randomly.

Problem Description:

Windows SharePoint Portal is a web application that runs on .NET Framework, based on ASP.NET. The Administrator is able to configure the UI language, like English or Chinese. Someday a customer complained that their SharePoint could not show Chinese any more. The detailed behaviors were:

1. The configuration is set to Chinese.

2. The application works fine in the past year.

3. In most cases, the Chinese UI shows fine.

4. Occasionally, the UI shows in English.

5. It will turn back to Chinese automatically after some time.

6. If we restart IIS service when the UI is in English, it may turn back immediatlly.

Troubleshooting:

After basic check, the configurations are fine. With some initial debugging I found:

1. When the problem occurs, the customer’s worker thread UI culture is zh-cn, which is under the expectation.

2. The implementation for the UI language is based on the ResourceManager.GetString method:
ResourceManager.GetString
https://msdn2.microsoft.com/en-us/system.resources.resourcemanager.getstring.aspx

3. The satellites assemblies for UI are installed correctly.

Based on above information, the GetString CLR API call should return the Chinese resource. However, when the problem occurs, it returns English resource. Thus, the problem turns out to be, why the CLR API call behaves incorrectly. The difficult part is:

1. We cannot reproduce the problem simplify and consistently.

2. The problem occurs in the customer’s production server only. In test server, it does not occur. We need remote debugging on the production server.

3. The problem is caused by the CLR code, not the customer’s code. We cannot add trace to save execution log to check like what we did in the 3rd case.

At the same time, there were several similar issues reported in Europe. In Japan, it was more sericious and more than 10 sites got the same issue without solution for several weeks.

Since the only clue is GetString method, the troubleshooting plan is straight forward: trace the execution of the GetString method. Before jumping into remote debugging, firstly I performed research on the logic of GetString method:

1. When the UI resources are loaded from satellites assemblies, they are saved in an internal HashTable.

2. The GetString checks if the resource requested is in the HashTable, if it is, go to step 5.

3. If it is not there, GetString asks ResourceManager to load the resource from satellites assemblies based on current thread UI Culture

4. ResourceManager reads the resources from satellites assemblies and saved into the HashTable

5. GetString returns the resource from the HashTable for the request

6. If any problem occurs in above steps, like Satellites assemblies were not found, it falls back to use UI neutral (English by default) resource.

7. When the HashTable is initialized and resources are loaded, the ResourceManager is out of the story. GetString always returns the resource found in HashTable directly.

Based on above logic, it is likely that the ResourceManager fails to load the satellite assemblies. So it uses neutral resource instead. Based on this info, I used Filmon to check the file operations on the satellites files. However, Filemon shows all the file operations on resource file succeeded.

Since I tried everything I could think of, I had to use remote debugging to check. After the communication with customer, they allowed me to connect to the production server between 10 pm to 8 am in the next morning. To avoid the obstruction from client request and our test, I used the following steps:

1. In IIS setting, I restrict the client IP for our test client only.

2. Set the timeout to infinite to avoid timeout when debugger hangs the server.

3. Restart IIS every time before debugging.

4. Use the client browser to refresh the page, and trace the execution in debugger.

IP restriction is used to avoid obstruction from other clients. Restarting IIS is necessary because the HashTable initializes only once per life time. After several days hard working, found:

1. It fails when loading the satellites assemblies file.

2. During the satellites loading, there are two first chance CLR exception.

3. When the problem occurs, there are some string codepath executed. In local test, we do not see such codepath.

Based on above results, I verified the code in detail specifically. Combing with the wt command and exception monitoring, I captured some dump files during the strange codepath. After research, I found an important information: the customer installed .NET Framework 1.1 SP1. These codepath were new in SP1.

The full story is the following. In CLR 1.1, ResourceManager starts to probe the satellites assembly after checking the UI culture for current thread. In SP1, a new feature is added. Besides the UI culture setting, we can also use configuration file to guide the probing as well. For ASP.NET and Sharing Point, the configuration file is web.config. However, the web.config file is very important, and normal anonymous web user does not get the privilege to open it. When the code tries to check the settings in web.config, if the access denied, an CLR exception will be thrown. Due to the exception, the ResourceManager believes that something goes wrong in resource probing, and it falls back immediately to use neutral resource, hence the English shows instead of Chinese.

In other prospective, the developer did not expect that most users are now allowed to tough the application configuration file. (web.config here) in web applications. The web situation is not handled specially, thus it breaks a lot of web applications.

After the user restarts IIS, if the first request comes from an authorized user, who has the privilege to access the web.config file, the Chinese resource will be loaded fine and the problem does not occur. If the first uses cannot open the web.config file, the problem happens, and it keeps there until IIS restarting because the HashTable only initializes once per life. For the problem, the workaround is to set web.config for Everyone Read Allow. This workaround solves the SPS urgent issue globally.

Modifying the access privilege for web.config may cause security issue. Asking the developer to fix and applying a hotfix is the final solution:

FIX: Your application cannot load resources from a satellite assembly if the impersonated user account does not have permissions to access the application .config file in the .NET Framework 1.1 Service Pack 1

https://support.microsoft.com/?id=894092

Lesson:

I reviewed the troubleshooting history after the solution. In fact, I was quite near to the solution in my initial checks. I used Filemon to check the satellite assemblies file operations, but I did not go further to check for the other important files. I opened the Filemon result, search Access Denied, got more than 30 plances, all in web.config file. Meanwhile, I should have awarded that some recent released patch was suspicious because the problem did not occur in past 1 year, but flooded in several weeks globally.

Clear thought, good analysis and effective debugging is very good. However, if we can think out of the box sometime, we can do better.

End for the chapter.

The 1st chapter demonstrated 4 cases. When faceing with a problem next, if you think a bit more before action, you got what I want to say. There is no troubleshooting template. Instead, it matters about the way of think, how you use your brain, and some of the luck.

Next chapter, I will start to discuss some important knowledge for effective troubleshooting, including exception handling, memory, windbg, and……

Comments