How to use CHKSGFILES multi-threaded for faster consistency checks
How to use CHKSGFILES multi-threaded for faster consistency checks
One of the questions that we sometimes get here, and that we’ve never really been able to answer is: do I HAVE to run the database consistency check (the CHKSGFILES API) in single-threaded fashion? Doing things multi-threaded could be so much faster! Well the answer is some parts no, but some parts yes.
If you don’t already know, CHKSGFILES is used by applications that back up Exchange databases, to ensure that the database to be stored is actually in good health and not corrupted. With Exchange 2010, it’s much less likely there will be a corruption problem, but it's safer for your backup and restore application to verify the data before making the backup.
Well, after some extensive archeology, we’ve determined that, if you do it right, you actually CAN run parts of CHKSGFILES multi-threaded. We’ll be adding this information into the SDK for October 2010, but wanted to get the word out now for those of you who are interested.
The general, using CHKSGFILES to check database consistency in a multi-threaded application runs like in the following example. Remember, this is just an EXAMPLE, and is only intended to show which parts of CHKSGFILES must be handled in what way.
This example application uses two main processes, and a set of worker threads. The first (orange) process handles the overall backup job, while the second (blue) handles queue requests by creating worker threads to verify the database pages and log files. Central to the system is a request and completion-status queue.
Backup job-engine (the single-threaded part)
The Backup Job-engine process block (orange) in the diagram indicates the parts of the database consistency check that must be run single-threaded.
IMPORTANT Your application must never allow more than one Backup Job-engine block to operate at the same time. The CHKSGFILES APIs shown in that block should never be called in parallel. Your application should ONLY call CHKSGFILES consistency checks in sequence. The CHKSGFILES DLL does not support out-of-sequence or simultaneous calls to New(), ErrInit(), ErrCheckDbHeaders(), ErrTerm() and Delete(). Your application should call those APIs only once for each consistency check. After the sequence shown in the Backup Job-engine block has been completed, only then can that sequence be restarted.
When the Backup Job-engine section starts, it should initialize whatever request queuing mechanism is being used. In this example, it’s a queue that includes the requests, the return status of those requests, and a separate process that scans for entries in the request queue. Because the intention of running a database consistency checks is to find errors, the queue needs to return that success / failure information to the main part of the program.
After the queue is initialized, the backup job can use Windows VSS to take the snapshot. After the snapshot is successfully made and available, the backup application can call the CHKSGFILES New() function to obtain an instance of the API. The backup job must then call ErrInit(), indicating which databases are to be checked, the log-file path, and other parameters.
Then the backup job process calls the ErrCheckDbHeaders() function, to verify that all the databases have the proper header information. It is very important that ErrCheckDbHeaders() be called only once to check ALL the databases that were specified in ErrInit(). For Exchange 2007, this will likely be all the databases in a storage groups. For Exchange 2010, this will probably be only a single database, because ErrInit() accepts only a single log file path.
In this example, the single-threaded Backup Job-engine then adds requests into the queue for the database pages in each database, and a single request to check the log files.
In this example, the backup job-engine then waits until the request queue contains completion status for all the requests. Then the backup job-engine calls the ErrTerm() function. Like ErrCheckDbHeaders(), your application must call ErrTerm() only once. It is up to the application to ensure that it has tried to check all the database pages and log files. Do not try to use ErrTerm() to keep track of the progress: if you call ErrTerm() and all the pages and logs have not been checked, it will return an error and invalidate all checks that had been done on those databases. In this example, the backup job-engine process uses the queue entries to keep track of which database pages have been checked.
Finally, the backup job engine can call the Delete() function, to dispose of the CHKSGFILES instance. Then, based on the results, the backup application can copy the snapshot contents to the backup media, and continue processing the backup job.
Implication of storage groups
At this point a short discussion about the parameters passed to ErrInit() is appropriate. In Exchange 2007, when CHKSGFILES was introduced, the Exchange storage architecture includes Storage Groups, which are collections of databases that can be managed together as a unit. So, with Exchange 2007 servers, you will probably pass a list of the databases in a storage group to the ErrInit() function. But, Exchange 2010 doesn’t have storage groups. Indeed, typically each database and log file set is kept separate. So, for checking Exchange 2010 databases, you will very likely pass a single-element array with one database and log file path to ErrInit(). Remember, ErrInit() requires the databases to be specified in an array, even when there is only one database to be checked.
You might wonder: if the application typically only sends a single database through the CHKSGFILES APIs at a time, what good is multi-threading? Good question! For that individual database, the page checks can be run in parallel, which will certainly speed up the process. But if you need to check multiple Exchange 2010 databases, your application will need to separately call New() and ErrInit() for each database, and separately handle the different instances of the CHKSGFILES API. Just like with a single set of databases sent to ErrInit(), the single-threading and sequencing rules for the API have to be followed for each instance.
Queue Servicing process (the multi-threaded part)
As you can tell, nearly the whole CHKSGFILES API needs to be run in a single thread, and there can be no out-of-sequence calls made. In the example application, the CHKSGFILES parts that can be run multi-threaded are shown as the Queue Servicing process (blue), and the Database Page and log file worker threads (green).
IMPORTANT The CHKSGFILES API does support checking the log files [using the ErrCheckLogs() API] parallel to checking the database pages [using the ErrCheckDbPages() API]. However, your application must call ErrCheckLogs() only ONCE for each set of databases that were passed to ErrInit(). If ErrCheckLogs() is called more than once, an error will be returned, and the entire consistency check will have failed, even if no actual database or log file errors exist.
When the queue servicing process starts, it begins checking for new entries in the request queue. When it sees a new request, it can start a new worker thread to service that request.
When it starts, the worker thread (green) should obtain the request information from the queue (or directly from the request queue servicing process), and then perform the check. In this example application, it is up to the worker thread to process the request appropriately: database page requests use ErrCheckDbPages(), while log file requests use ErrCheckLogs(). If the backup job-engine process is running properly, there should never be more than one request for log file checks for each set of databases passed to ErrInit(). When the check has completed, the worker thread should update the request status information in the queue, and then the thread should exit.
When all the dispatched threads have exited, the queue processing service can signal the backup job-engine process via the queue. Alternatively, the backup job-engine process can detect when there are no more unprocessed requests in the queue.
So, that’s all there is. It’s not terribly complicated, but your application must follow these rules:
- CHKSGFILES New(), ErrInit(), ErrCheckDbHeaders(), ErrTerm() and Delete() can only be called from a single thread, and must be called in the proper sequence, and can only be called once.
- CHKSGFILES ErrCheckDbPages() and ErrCheckLogs() can be called in a multi-threaded manner, but only after ErrCheckDbHeaders() is called and before ErrTerm() is called.
- CHKSGFILES ErrCheckLogs() must only be called once for all the databases specified to ErrInit(). But, ErrCheckLogs() can be called in parallel with calls to ErrCheckDbPages().
- If you’re backing up multiple Exchange 2010 databases, you must obtain separate CHKSGFILES instances for each database, and the sequencing and concurrency rules still apply for each instance.
Using a combination of single- and multi-threading, and obeying the rules described in this blog post, your backup application can more quickly check Exchange databases than in a purely single-threaded manner.
Thom Randolph
Documentation Manager
Exchange Developer Documentation
Microsoft Corporation