Who said that transaction goes from Logs to DB!!!!!

I've heard lot of discussions going about how Exchange ESE works and that the transactions goes from the Logs to the DB based on the checkpoint depth. I've collected some small info from different sources trying to explain how ESE writes transactions to the DB.

The following five subcomponents of ESE work together to move the data into the database and to its static form:

  • Log buffers   When ESE first receives a transaction, it stores it in log buffers. These log buffers are used to hold information in memory before it is written to the transaction logs. By default, each buffer unit is the size of a disk sector, which means that it's 512 bytes in size. JET does some sanitation to make sure that the number of buffers is a minimum of 128 sectors, a maximum of 10,240 sectors, and aligns to the largest 64 KB boundary. So, for Exchange 2000 Server (and all service packs) the default number of log buffers is 84, which JET sanitizes to 128 so the actual buffer area is 64 Kbytes. For Exchange Server 2003, the default number of log buffers is 500, which JET sanitizes to 384 so the actual buffer area is 192 KB.
  • Log writer   As the buffers fill up, ESE moves the data from the buffers onto disk and into the log files. In this operation, the transactions are committed to disk to the logs in a synchronous fashion. This process is fast, because it is crucial to move the data from memory and into the transaction logs quickly in case of a system failure.
  • IS buffers   The IS or cache buffers are the first step toward turning a transaction into actual data. The IS buffers are a group of 4 kilobyte (KB)-pages allocated from memory by Exchange for the purpose of caching the database pages before they are written to disk. When first created, these pages are clean, because they have yet to have any transactions written to them. ESE then plays the transactions from the logs into these empty pages in memory, thereby changing their status to dirty. The default value for maximum size these buffers can reach is 900 MB in Exchange 2000 Server SP 3.
  • Version store   ESE writes multiple, different transactions to a single page in memory. The version store keeps track of and manages these transactions. It also structures the pages as the transactions occur
  • Lazy writer   At this point, ESE must flush the dirty pages out of memory. The lazy writer is responsible for moving the pages from the cache buffers to disk. Because there are so many transactions coming in at once and so many pages getting dirtied, the job of the lazy writer is to prioritize them and subsequently handle the task of moving them there without overloading the disk I/O subsystem. This is the last phase and the point at which the transactions have officially become static data. It is also at this point that the dirty pages are cleaned and ready for use again.

What about the checkpoint file and other components???

The Checkpoint File. The database engine maintains a checkpoint file called Edb.chk for every log file sequence in order to keep track of the data that has not yet been written to the database file on disk. The checkpoint file is a pointer in the log sequence that indicates where in the log file the information store needs to start the recovery in case of a failure. The checkpoint file is essential for efficient recovery. Without it, the information store would start from the beginning of the oldest log file on the disk and check every page in every log file to determine whether it had already been written to the database--a time-consuming process, especially if all you want to do is make the database consistent.

The Checkpoint depth. The checkpoint depth is a threshold that defines when ESE begins to aggressively flush dirty pages to the database.

ESE Cache. An area of memory reserved to the Information Store process (Store.exe) that is used to store database pages in memory. By storing pages in memory, this can reduce read I/Os, especially when the page is accessed many times; in addition, this cache can be used in two ways to reduce write I/Os - by storing the dirty page in memory, there is the potential for multiple changes to be made to the page before it is flushed to disk; also, the engine can write multiple pages together (up to 1MB) using write coalescing

In Exchange 2000 & Exchange 2003 we had two main files for each Exchange database which are:

The EDB file. The .edb file is the main repository for the mailbox data. The fundamental construct of the .edb file is the b-tree structure, which is only present in this file, and not in the .stm file. The b-tree is designed for quick access to many pages at once. The .edb file design permits a top level node and many child nodes. Tree depth has the greatest effect on performance. A uniform tree depth across the entire structure, where every leaf node or data page is equidistant from the root node, means database performance is consistent and predictable. In this way, the ESE 4 KB pages are arranged into tables that form a large database file containing Exchange data. The database is actually made up of multiple b-trees. These other ancillary trees hold indexing and views that work with the main tree. The .edb file is accessed by ESE directly.

The STM file. The .stm or streaming media file is used in conjunction with the .edb file to comprise the Exchange database. The purpose of the .stm file is to store streamed native Internet content. Unlike the .edb file mentioned previously, the .stm file does not store data in a b-tree structure. When a message arrives through the Internet or Simple Mail Transfer Protocol (SMTP), it always arrives as a stream of bytes. In Exchange Server 2003 and Exchange 2000 Server, these messages are streamed directly to the .stm file where they are held until accessed by a MAPI client. So the content is not converted. That way, if the end user is consistently accessing mail through POP3, the mail items are pulled directly from the .stm file and are already in the proper state for delivery. In the case that the message is accessed by a MAPI client, however, the message is moved over to the .edb file and converted to Exchange native form, and is never moved back to the .stm file.

Note that we removed STM file from the Exchange 2007.

So how do the above components work together?

  1. An operation occurs against the database (e.g. client sends a new message), and the page that requires updating is read from the file and placed into the ESE cache (if it is not already in memory), while the log buffer is notified and records the operation in memory.
  2. The changes are recorded by the database engine but are not immediately written to disk; instead, these changes are held in the ESE cache and are known as "dirty" pages because they have not been committed to the database. The version store is used to keep track of these changes, thus ensuring isolation and consistency are maintained.
  3. As the database pages are changed, the log buffer is notified to commit the change, and the transaction is recorded in a transaction log file (which may or may not require a log roll and a start of a new log generation).
  4. Eventually the dirty database pages are flushed to the database file.
  5. The checkpoint is advanced.

In Exchange 2007 we changed the database architecture in four significant respects, this is to make use of the 64bit platform so the result is:

  1. The streaming database (.stm) file has been removed from Exchange 2007.
  2. Longer log file names are used, thereby enabling each storage group to generate as many as 2 billion log files before log file generation must be reset.
  3. Transaction log file size has been reduced from 5 MB to 1 MB to support the new continuous replication features in Exchange 2007.
  4. The database page size has increased from 4 KB to 8 KB.

And to push the ESE engine another step forward in terms of performance, Exchange SP1 introduced two major updates and some enhancements in the online maintenance process. I recommend reading part I at https://msexchangeteam.com/archive/2007/11/30/447640.aspx and Part II at https://msexchangeteam.com/archive/2007/12/06/447695.aspx

Comments