Share via


Best practices for writing job template level filters for the HPC Job Scheduler Service

This topic provides best practices for implementing custom job activation and job submission filters that are defined at the job template level (as DLLs). The HPC Job Scheduler Service can run these custom filters to provide additional checks and controls when jobs are submitted to the cluster (submission filters) or when jobs are about to get cluster resources (activation filters).

For more information about custom filters, see Understanding Activation and Submission Filters and the

Microsoft.Hpc.Scheduler.AddInFilter.HpcClient Namespace API reference.

 

Examples of custom filters are available in the Microsoft HPC Pack 2008 R2 SP2 SDK code samples (in the Scheduler/Filters/DLL folder).

How to avoid causing CannotUnloadAppDomain exceptions in the scheduler process

The Scheduler process hosts each Job template level filter (DLL filter) in its own AppDomain. When the HPC Job Scheduler Service unloads a filter, it does so by unloading the filter's AppDomain.  If the filter’s AppDomain cannot be unloaded successfully, a piece of the .Net Framework will terminate the scheduler process. This section provides information about when the HPC Job Scheduler Service unloads custom filters, and provides some guidance about avoiding a common cause of the CannotUnloadAppDomain exception: Application Domain threads that cannot immediately stop execution.

This behavior is only triggered by add-in code that results in the AppDomain being in an un-unloadable state. Given this, we recommend the following:

  • The filter development cycle should, if at all possible, be carried out on a non-production cluster.  This minimizes any consequences of filter code that triggers this problem.
  • Filter unloading in production clusters should be scheduled to minimize potential impacts.

How the scheduler unloads filter DLLs

In the course of operations, cluster administrators might need to unload a particular filter (DLL). Some cases might include:

  • New version of DLL is available.
  • Existing DLL is consuming too much memory or CPU.
  • Existing DLL is no longer needed or desired.

DLL filters are unloaded in the following cases:

  • Scheduler process exits by virtue of a clean/polite shutdown (like "net stop").  Note that all referenced filters are loaded on scheduler startup (automatic) so a service restart would still result in a locked filter DLL.
  • All references to a filter are removed from the system. Any time that job templates are changed in the system (added, edited, or removed), the HPC Job Scheduler Service identifies any orphaned filters (loaded filters that are not referenced by any template) and unloads them.

When an instance of a DLL filter is unloaded, three things generally occur:

  1. An exclusive lock is taken across all of the customer facing interfaces of that instance. Unload will wait until any in-flight calls have completed. All new calls will block.
  2. IFilterLifespan.OnFilterUnload() is called, if implemented.
  3. An attempt will be made to unload the AppDomain hosting the filter DLL.

Step 3 above can result in a CannotUnloadAppDomain exception if an AppDomain thread cannot stop execution promptly.

If you are writing DLL filters, it is helpful to understand the basics about AppDomain end-of-life and the CannotUnloadAppDomain exception. See the following articles on MSDN for more information:

AppDomain.Unload Method
CannotUnloadAppDomainException Class

How to write filters that don't block the unloading of filter AppDomains

This section examines two tempting patterns that can block the unloading of filter AppDomains (and potentially crash the Scheduler process). Each example provides explanatory notes.

 Example 1:  Objects left to the GC can hold AppDomain hostage.

 ** To be avoided:**

   IScheduler scheduler = new Scheduler();
   scheduler.Connect("localhost");

                 ISchedulerJob job = scheduler.OpenJob(jobID);

                // more code here

                scheduler.Close();
                scheduler.Dispose();  << potential trouble here if exceptions cause this to be skipped

 ** Preferred:**

                using (IScheduler scheduler = new Scheduler())
                {
                          scheduler.Connect("localhost");

                          ISchedulerJob job = scheduler.OpenJob(jobID);

                          // more code here
                }

Notes:
 The trouble with Example 1 is that the IDisposable interaface on the remoted object is not honored promptly if there is an exception in the intermediate code. 
 Under these circumstances, the remoting mechanics will keep the AppDomain from being unloadable.  The "using" statement is an effective tool
 that enforces proper cleanup of such complicated objects.

 There are two vulnerabilities in Example 1:  

  • The GC may not get around to disposing of complicated objects before AppDomain unload.
  •  If execution flow is interrupted (ie: via exceptions), explicitly coded cleanup may not occur before AppDomain unload.

  Example 2:  Catch blocks can hold AppDomain hostage:

** To be avoided:** 
       public void InfiniteLoopProc(object obj)
        {
            try
            {
                LogEventMsg("Entering Infinite loop thread proc.");
                while (true)
                    ;
            }
            catch (Exception ex)
            {
                LogEventMsg("Exception in infinite loop (about to loop): " + ex.ToString());

                while (true)
                    ;
            }
            finally
            {
                LogEventMsg("InifiniteLoop finally clause.  About to loop");

                while (true)
                    ;
            }           
        }

        public SubmissionFilterResponse FilterSubmission(Stream jobXmlIn, out Stream jobXmlModified)
        {
            LogEventMsg("FilterSubmission: about to spawn an infinite looping thread.");

            ThreadPool.QueueUserWorkItem(new WaitCallback(InfiniteLoopProc));

            SubmissionFilterResponse retval = SubmissionFilterResponse.SuccessNoJobChange;

            jobXmlModified = null;

            return retval;
        }

Notes: 
In Example 2 we have an implmenetation of ISubmissionFilter.FilterSubmission() that returns successNoJobChange immediately after spawning a new thread. This new thread enters into some important business logic (here represented by an infinite loop) in a try/catch/finally block.

When an attempt to unload the AppDomain is made, all threads are given the abort signal.  The "catch all" catch block of the business logic catches the exception and begins performing lengthy cleanup. Here the lengthy cleanup is represented by an infinite loop. Because the thread refuses to exit in a timely fashion, the AppDomain is unable to unload. Note that either the catch or finally clause can effectivly trigger this phenomenon (loops in both for this example).