Partager via


I Miss You SQL Server Agent: Part 3

[This article was contributed by the SQL Azure team.]

Currently, SQL Azure doesn’t have the concept of a SQL Server Agent. In this blog series we are attempting to create a light-weight substitute using a Windows Azure worker role. In the first blog post of the series, I covered how the Windows Azure worker roles compare to SQL Server Agent, and got started with Visual Studio and some code. In the second blog post, I covered how to code a worker role that only executes a job once per day. In this third post in the series, I am going to dive into error handling for Windows Azure worker roles.

Handling Errors

When coding multi-threaded applications, i.e. like the example in the second blog post, that want to accomplish something successfully once, it is very important to handle the error cases. The code needs to know if it really was completed, or if there was an error and needs to be restarted. What happens if the worker role is recycled before the job can complete? Or the worker role throws an exception? Or we get a transient error from SQL Azure? In the example code, when a job doesn’t complete in these situations there will be a row in the database, with a start execution date but no end execution date, and no worker role attempting to complete it. Basically a call to our StartJob stored procedure without a call to StopJob stored procedure. It is important to note that the row is in the same state when a worker role is processing the job – to tell the different between the two we are going to make an assumption about how long the job should take.

Transactions

If we can complete a job in a single transaction on a single database, then we can create a job processing system which will recover on Windows Azure worker role exceptions, Transact-SQL errors, or SQL Azure transient errors.

If SQL Azure is attempting to complete operations that are in a local transaction (see the transaction blog post for more information) and there is an error on the client side, then SQL Azure will close the connection on your behalf which forces a rollback of the transaction. This means that if the Windows Azure worker role fails for any reason: either from Windows Azure failure or an exception the client-side, the connection will close causing a rollback.

You always get one of two outcomes:

  • The job and transaction complete successfully and you are able to called StopJob stored procedure from the Windows Azure worker role. If we take special care to make sure that StopJob succeeds (discussed later) then we can successfully maintain our once per day tracking.
  • The Windows Azure worker role has an exception or fails which causes the connection to close, the job operations to be rolled back, and StopJob is not called.

There is never a case where operations are half executed (thanks to SQL Azure transactions), or that StopJob is called and the transaction was rollback. It also means, that we can bundle worker role exceptions, and worker role failures in the same category – handling them with the same error handling.

Self-Monitoring Worker Roles

If we want to police job completion and we know that the worker role might be recycled, we need to clean-up worker jobs that have not finished. The easiest way to do this is to make an assumption about the maximum time a job could possibly run, and delete all rows in the database where the job has run longer than that, however hasn’t finished. Deleting the row will cause the worker role to restart the job and hopefully this time it will complete without errors.

Here is what the cleanup Transact-SQL code looks like that cleans up all jobs that have been running over an hour:

 CREATE PROC CleanUp
AS
DELETE jobactivity 
WHERE stop_execution_date IS NULL AND DATEDIFF(minute,start_execution_date,GetDate()) > 60 

It is called from the running worker role instances in the form of a stored procedure, in the main loop.

 public override void Run()
{
    // This is a sample worker implementation. Replace with your logic.
    Trace.WriteLine("WorkerRole1 entry point called", "Information");

    while (true)
    {
        // WWB: Clean Up The Unfinished Jobs
        CleanUp();

        DateTime nextExecutionTime = new DateTime(
            DateTime.UtcNow.Year,
            DateTime. UtcNow.Month, DateTime. UtcNow.Day,
            13, 0, 0);

        if (DateTime. UtcNow > nextExecutionTime)
        {
            // WWB: After 1:00 pm, Try to Get a Job Id.
            Guid? jobId = StartJob("TestJob");
            if (jobId.HasValue)
            {
                Trace.WriteLine("Working", "Information");

                // WWB: This Method Has the Code That Execute
                // A Stored Procedure, The actual Job
                ExecuteTestJob();

                StopJob(jobId.Value);
            }

            // WWB: Sleep For An Hour
            // This Reduces The Calls To StartJob
            Thread.Sleep(3600000);
        }
        else
        {
            // WWB: Check Every Minute
            Thread.Sleep(60000);
        }
    }
}

So what happens if there are multiple worker role instances and one cleans up a job that the other one is still running, the situation where the actual time of completion for the job is longer than the assumed?

  • The row in jobactivity is deleted by a worker role instance, erroneously during clean-up.
  • The other worker role instance completes the job, StopJob is called, the row doesn’t exist, which doesn’t result in an error.
  • Then the job is completed again – which violates our once per day goals. There’s no way in a distributed system like this to guarantee once-per-day. You can only guarantee at-least-once-per-day (with best effort to be exactly once).

The best way to be successful is to make sure the assumption about the maximum runtime is very generous in the clean up stored procedure.

Also, be very careful with the times, if you start a job too late in the day, and the assumption for maximum runtime is too long, there might not be any time left in the day to complete the job if the job fails. For example, if you have a job that takes 45 minutes job starting at 11:00 pm, that fails the CleanUp stored procedure has to remove the row, leaving not be enough time to process the job again.

Transient Errors in SQL Azure

Transient errors in SQL Azure are errors that happen when SQL Azure failover is happening; however an identical call a moment later will not error. You can learn more about them here in this blog post. It is very important that StopJob stored procedure be successfully called if the job was completed successfully. For this reason you want to make sure you perform the retry scenarios discussed in the blog post when calling the StopJob stored procedure. It just so happens, that all the other queries to SQL Azure don’t need to check for transient errors, since failure will result in the Windows Azure worker role instance being recycled, and the job being rerun after clean-up.

Error Reporting

These blog posts have touched on some error handling, and retry scenarios; however I haven’t talked about error reporting. Wouldn’t it be nice to know when a job failed? If rows had to be cleaned up? You can do that with Windows Azure Diagnostics, read more about that here.

Summary

I want to thank Steve Marx for proofing this series and giving me ideas about how all the pieces should go together. You can find his Windows Azure focused blog here, and find him on twitter here.

In no way does this blog series try to replace all the functionality of SQL Server Agent, my goal was to demonstrate some simple code to show you can use a Windows Azure worker role to replicate basic SQL Server Agent functionality.

Do you have a better way to accomplish the same thing? Post it in the comments below. Do you have questions, concerns, comments? Post them below and we will try to address them.