다음을 통해 공유


SQL Server: Understanding and Dealing with Page Restores

 

1. Introduction

One of the important responsibilities of a Database Administrator is to not only ensure database backups are taken properly but also be able to restore them when it becomes necessary. The time to restore a database and bring it back online to a point-in-time depends on several factors such as the frequency of backups, backups’ size, network speed, I/O. The larger the size of the database, the challenging it becomes to restore it. Fortunately, there are ways to minimize the restore/recovery window because of the several granularities at which a database can either be fully or partially restored. File or filegroup level backups allow restoring specific files or filegroups piece-by-piece without needing to perform a full restore thus leaving the rest of the database intact. While such a piecemeal restore approach takes less time than it takes to restore a full backup, there is still a chance of it taking longer if the file or filegroup size is huge. Imagine needing to restore a terabyte-sized full backup because of a single data page corruption. Not only would it be tedious but also incurs longer downtime until the database is back up. 

Thankfully, starting SQL Server 2005, individual pages can be restored within the data files thus making the restores more granular. Instead of restoring the full or piecemeal backups, we can restore specific pages from a backup in case they get corrupted. The SQL Server edition determines whether we can restore a page while the database remains online or offline. While offline restore incurs downtime, it would still be minimal when compared to the full-fledged offline restores. 

Note: Piecemeal restore is beyond the scope of this article and thus not discussed. If you are interested in learning more about it, I would suggest reading the following TechNet Wiki article that talks about it in greater detail. 
https://social.technet.microsoft.com/wiki/contents/articles/52112.sql-server-how-to-perform-piecemeal-restore-and-partial-backup-of-a-database.aspx

In this article, we will see how a page can be corrupted and recovered without data loss using page-level restore. Although SQL Server allows restoring up to a maximum of one thousand pages in a single restore statement, for simplicity, we will do a single page restore. 

2. Requirements

In order for page restores to work successfully, the following requirements should be met.

  1.  The database has to be in either the full or bulk-logged recovery model. There are some gotchas when working with page restores under the bulk-logged recovery model. It doesn’t work in simple recovery model.
  2. There must be a free-of-damage chain of log backups from the latest full or differential backup up to the time the page restore started. A switch to the simple recovery model during the period that the log backups are required breaks the log backup chain.
  3. For SQL versions older than 2008, the log must not have been truncated using BACKUP LOG with TRUNCATE_ONLY. 

3. Limitations

Page restore comes with some limitations:

  1. For corrupt data pages that stored minimally logged operations under the bulk-logged recovery model, page restore won’t work if those pages are not covered by a log backup. Not that page restores don’t work in the bulk-logged recovery model. They do, however, there will be problems if the corrupt pages had data written by a minimally logged transaction which is not captured in the log backup. 
  2. While page restores work in every edition of SQL Server, we can execute them online (I.e. keeping the database online) only in Enterprise edition. 
  3. We cannot use page restores to restore all page types. Only data, index, and text pages can be restored. GAM, SGAM, PFS, ML, and DIFF pages cannot be restored. IAM page can be restored. 
  4. The database boot page (page number 9 in file 1) and file header pages (page number 0 in each file) cannot be restored. These are critical pages. Even CHECKDB repair cannot fix corruption on them.
  5. If a page belongs to a critical system table, it can be restored, however, it would be an offline operation (regardless of edition) where the database remains unavailable to users. Note that it has to be single-page per restore with critical system table pages. 

4. Setting up the Environment 

Let us set the stage to test page restores. We will begin by creating a new database called 'PageRestore' and immediately switch it to Full Recovery model so we can run log backups later. We then create a table called 'TblPageRestore' with a single 'ID' column and insert 65536 rows in it. 

CREATE DATABASE PAGERESTORE
GO
USE [MASTER]
GO
ALTER DATABASE [PAGERESTORE] SET RECOVERY FULL WITH NO_WAIT
GO
USE PAGERESTORE
GO
IF EXISTS(SELECT * FROM SYS.TABLES T JOIN SYS.SCHEMAS S ON T.SCHEMA_ID = S.SCHEMA_ID
WHERE S.NAME = 'DBO' AND T.NAME = 'TBLPAGERESTORE') DROP TABLE DBO.TBLPAGERESTORE;
GO
 
CREATE TABLE DBO.TBLPAGERESTORE
(
    ID VARCHAR(10) NOT NULL,
);
 
;WITH N1(C) AS (SELECT 0 UNION ALL SELECT 0)
,N2(C) AS (SELECT 0 FROM N1 AS T1 CROSS JOIN N1 AS T2)
,N3(C) AS (SELECT 0 FROM N2 AS T1 CROSS JOIN N2 AS T2)
,N4(C) AS (SELECT 0 FROM N3 AS T1 CROSS JOIN N3 AS T2)
,N5(C) AS (SELECT 0 FROM N4 AS T1 CROSS JOIN N4 AS T2)
,IDS(ID) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM N5)
INSERT INTO DBO.TBLPAGERESTORE(ID)
    SELECT CONVERT(VARCHAR(10),ID)
    FROM IDS;
GO

Below is the output. 

Next, let's take a full backup while the database is free of corruption. It's important that this full backup is clean since the restore sequence starts off with this backup followed by the sequence of log backups up to the time the page restore is started. We can also incorporate the latest differential backup in the restore sequence (if one is available), and start the log backup sequence after the differential backup. 

Tip: If the full backup is taken with CHECKSUM, you can run RESTORE WITH VERIFYONLY to ensure the backup file is corruption-free prior to restoring it. Otherwise, it would be mandatory to go back to an earlier clean full backup and start restoring the log backups in sequence up to the time the page restore started (unless the chain is broken, in which case the log sequence cannot be restored).

The restore can also be started with a full filegroup or file backup, as long as the backup consists of all the pages that are being restored. Either way, let's take a full backup and a log backup before corrupting a page. 

--TAKE A FULL BACKUP BEFORE CORRUPTING THE RECORD
BACKUP DATABASE [PAGERESTORE] TO  DISK = N'D:\SQL_BACKUPS\PAGERESTORE_FULL_BEFORECORRUPTION_2.BAK'
WITH NOFORMAT, NOINIT,  NAME = N'PAGERESTORE-FULL DATABASE BACKUP', SKIP, NOREWIND, NOUNLOAD,  STATS = 10, CHECKSUM
GO
--TAKE THE FIRST LOG BACKUP BEFORE CORRUPTING THE RECORD
BACKUP LOG [PAGERESTORE] TO  DISK = N'D:\SQL_BACKUPS\PAGERESTORE_TLOG_BEFORECORRUPTION.TRN' WITH NOFORMAT, NOINIT,  NAME = N'PAGERESTORE-TLOG DATABASE BACKUP', SKIP, NOREWIND, NOUNLOAD,  STATS = 10, CHECKSUM
GO

5. Corrupting a Page

Now that the database is ready and the backups have been taken, let us examine the pages in the table 'TblPageRestore'  and pick a page as a candidate for corruption. We will use the undocumented DBCC IND command to analyze the table's internal structure. 

Let's pick the last page 2256 highlighted in the partial screenshot above, and see what rows are covered on that page. To examine the rows and page structure, we will use the undocumented command DBCC PAGE and supply the page number 2256 and file ID 1. Traceflag 3604 allows DBCC PAGE to print the output. 

Notice the yellow highlighted text above in the rectangular red box. It contains a single row (ID = 65284). Note the slot number is 0 and offset is 0x60 (in hex). Our goal here is to corrupt the row (ID = 65284) on page 2256. The rest of the rows on the page remain intact. The page header also indicates that the data recorded on the page is not the result of a minimally logged operation. 

Let's calculate the column ID's offset so we have the actual address of the row ID 65284. This is done by adding the row offset 0x60 and the column offset (5) and convert into int which gives the offset 101 in decimal (see the code below). 

Next, we will use an extremely powerful undocumented command DBCC WRITEPAGE to corrupt the record on the page 2256 by supplying an invalid value and other appropriate parameters (shown in the code below). 

IMPORTANT: DBCC WRITEPAGE is undocumented and not supported, and can potentially corrupt the database, so use it at your own risk. DO NOT run it in production or you run the risk of losing data. 

All statements are given below. The parameters are explained in the comments. 

USE PAGERESTORE
GO
DBCC IND (N'PAGERESTORE', N'TBLPAGERESTORE', -1);
GO
 
-- PICK THE LAST PAGE FROM THE BOTTOM = 2256
-- CHECK SLOT 0 IS ID 65284
DBCC TRACEON (3604)
DBCC PAGE (N'PAGERESTORE', 1, 2256, 3);
GO
 
 
-- TAKE THE SLOT 0 OFFSET, CONVERT TO AN INT = 101
SELECT CONVERT (INT, 0X60 + 5);
GO
 
-- CORRUPT SLOT 0'S RECORD (ID=65284)
-- 2256 IS THE PAGE ID
-- 101 IS THE CONVERTED INT FROM ABOVE
DBCC WRITEPAGE (N'PAGERESTORE', 1, 2256, 101, 8, 0XFFFFFFFFFFFFFFFF);
GO
 
-- CHECK IF CHECKDB RETURNS ERRORS
DBCC CHECKDB (N'PAGERESTORE') WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
 
--SELECT THE CORRUPTED ID VALUE FROM THE TABLE
SELECT * FROM TBLPAGERESTORE WHERE ID = 65284
 
--SEE WHAT THE VALUE FOR THE CORRUPT ID IS
SELECT * FROM TBLPAGERESTORE

Now that the record has been corrupted, let's run DBCC CHECKDB and see what it returns.

Clearly, we successfully corrupted the record. There are multiple ways of identifying corrupt pages. Of course, DBCC CHECKDB is the most preferred since it is thorough, but the other ways to check corrupt pages are the suspect_pages table in msdb, and SQL error log for high-security errors. Though suspect_pages contains a list of damaged pages, it may still miss some if SQL hasn't encountered them. 

Let's run a select query against the table for the corrupt value and see what it returns. 

As can be seen above, it errored out on the corrupted record. Let's try to retrieve all records and see what it has for the corrupted record. 

It has 'NULL' for the corrupt record (ID = 65284). In the next section, we will see how we can restore and recover the page without data loss. 

6. Recovering the Corrupted Page

Before we start the restore sequence, let's run a log backup to back up any changes that may have occurred in the database. Since ours is a sample database, there is nothing to back up, but in a real-world environment, we need to run the log backup at this point. Remember we already took one log backup after the full backup (before corruption) so this one is the second log backup in sequence. 

--TAKE THE SECOND LOG BACKUP POST CORRUPTION BUT BEFORE STARTING THE PAGE RESTORE
BACKUP LOG [PAGERESTORE] TO  DISK = N'D:\SQL_BACKUPS\PAGERESTORE_TLOG_AFTERCORRUPTION.TRN' WITH NOFORMAT, NOINIT,  NAME = N'PAGERESTORE-TLOG DATABASE BACKUP', SKIP, NOREWIND, NOUNLOAD,  STATS = 10, CHECKSUM
GO

Next, restore the corrupt page from the full backup with no recovery. Notice the parameter "PAGE = '1:2256' indicates we are restoring a single page and not the entire full backup.
 

-- RESTORE PAGE 2256 FROM THE UNCORRUPTED FULL BACKUP
RESTORE DATABASE PAGERESTORE
PAGE = '1:2256'
FROM DISK = 'D:\SQL_BACKUPS\PAGERESTORE_FULL_BEFORECORRUPTION_2.BAK'
WITH NORECOVERY

As can be seen in the output above, we have successfully restored the corrupt page from the full backup. However, it hasn't been recovered yet. If we try querying the table 'TblPageRestore', it returns all rows until it hits the pages being restored, and returns the following error indicating that the page is marked Restore Pending. 

At this point, we have only restored the full backup. Let's continue to restore the two log backups, again, with no recovery.

RESTORE LOG PAGERESTORE
FROM DISK = 'D:\SQL_BACKUPS\PAGERESTORE_TLOG_BEFORECORRUPTION.TRN' WITH NORECOVERY
 
RESTORE LOG PAGERESTORE
FROM DISK = 'D:\SQL_BACKUPS\PAGERESTORE_TLOG_AFTERCORRUPTION.TRN' WITH NORECOVERY

Alright, we have now restored all backups and are ready to recover the page. Note that the database is online and operational throughout the process. Let's try to recover now. 

--THIS WILL RETURN WARNING
RESTORE DATABASE PAGERESTORE WITH RECOVERY

It returned the following warning and the recovery failed. The reason is since the database remained online, the restored page is not consistent with the rest of the online part of the database. 

We have to bring the page up-to-date by performing one last log backup. Note this not the tail-log backup. It's just another log backup that is taken after the page restores started and should have the LSN higher than the LSN for a transaction that affected the page. 

Note: Had this page restore been an offline operation (as in Standard edition), we would have taken the tail-log backup before the page restore started which would have switched the database into "recovering" mode. Hence, in that case, no additional log backup would be required. But because ours is an Enterprise edition and online restore, we have to take a log backup after the restore started. The log backup can be taken with the COPY_ONLY option so as to not affect the regular log chain. 

Let's run the third log backup.

--TAKE THE LAST LOG BACKUP POST CORRUPTION AFTER STARTING THE PAGE RESTORE
BACKUP LOG [PAGERESTORE] TO  DISK = N'D:\SQL_BACKUPS\PAGERESTORE_TLOG_AFTERPAGERESTORE.TRN' WITH NOFORMAT, NOINIT,  NAME = N'PAGERESTORE-TLOG DATABASE BACKUP', SKIP, NOREWIND, NOUNLOAD,  STATS = 10, CHECKSUM
GO

In a real prod environment, and depending on when the page gets corrupted, there may be a long log backup chain to restore, and it would be difficult to write individual restore statements for all backups. An easier way is to generate the restore commands based on the backup times in the MSDB database backup history tables. Once the statements are generated, we simply need to run them in sequence starting with the earliest. The script to generate the restore statements is given below. 

DECLARE @DBNAME VARCHAR(100) = 'PAGERESTORE'
 
SELECT DATABASE_NAME, BACKUP_START_DATE,
        CASE TYPE
            WHEN 'D' THEN 'FULL DATABASE BACKUP'
            WHEN 'I' THEN 'DIFFERENTIAL BACKUP'
  END AS BACKUPTYPE
    FROM MSDB..BACKUPSET BSET
        INNER JOIN MSDB..BACKUPMEDIAFAMILY BMFAMILY
          ON BSET.MEDIA_SET_ID = BMFAMILY.MEDIA_SET_ID
        WHERE DATABASE_NAME = @DBNAME
            AND TYPE IN ('D','I')
 
-- REPLACE THE @BACKUPSTART VALUE WITH THE ONE RETURNED FROM ABOVE
DECLARE @BACKUPSTART DATETIME = '2019-08-01 05:45:26.000'
 
SELECT 'RESTORE LOG [' + @DBNAME + '] FROM DISK = ''' + BMFAMILY.PHYSICAL_DEVICE_NAME + '''
    WITH NORECOVERY' AS 'RESTORE LOG STATEMENT'
FROM MSDB..BACKUPSET BSET
    INNER JOIN MSDB..BACKUPMEDIAFAMILY BMFAMILY ON BSET.MEDIA_SET_ID = BMFAMILY.MEDIA_SET_ID
    WHERE DATABASE_NAME = @DBNAME
        AND BACKUP_START_DATE >= @BACKUPSTART AND TYPE = 'L'
    ORDER BY BACKUP_START_DATE

In our setup, the above script returned the three log backups. 

Let's start restoring the last one since we have already restored the first two backups. 

RESTORE LOG PAGERESTORE
FROM DISK = 'D:\SQL_BACKUPS\PAGERESTORE_TLOG_AFTERPAGERESTORE.TRN' WITH NORECOVERY

Let's try to run recovery.

As you can see, the recovery ran successfully. Let's see if the corrupt record was successfully recovered or not. 

The record (ID = 65284) was successfully recovered with no data loss. 

Let's now run DBCC CHECKDB and ensure there is no corruption. 

It is evident from the CHECKDB outcome above that the corruption has been successfully removed from the database 'PageRestore'

7. Page Restores Using GUI

Pages can also be restored using the SSMS GUI. We have to specify the pages we want to restore followed by verifying the backup set. There is a button to run CHECKDB with PHYSICAL_ONLY to identify the corrupt pages from within the restore wizard. 

The backup sets are read from the backup history in MSDB and are auto-populated in the wizard. we just need to ensure that the backup sets are correct, and proceed to restore. 

8. Conclusion

Page-level restores provide a more granular level of recovery and greater flexibility by allowing individual pages to be restored without having to restore the entire database. This feature is invaluable when only a subset of pages are corrupted in a mission-critical database that cannot afford longer downtime.  We can run page-level restore online in Enterprise edition of SQL Server whereas, in other editions, it is an offline operation. Even with offline, the downtime is minimal because of restoring the smallest unit of storage (i.e. page) as opposed to an entire database. For page restores to work, the database has to be in full or bulk-logged recovery model, both of which allow log backups. Since log backups are not possible under the simple recovery model, page restores are not allowed there. Also, not all pages are candidates for page-level restores.

9. See Also