Slow recovery times and slow performance due to large numbers of Virtual Log Files
In my previous post about database mirroring configuration problems caused by large numbers of virtual log files, I mentioned that such a scenario can have other unwanted side effects, with or without database mirroring being enabled. Here’s an example that demonstrates one of these scenarios. This shows how recovery times after a disaster can be significantly extended by allowing your log file to grow in small increments, thereby creating thousands of VLFs. It’s worth noting as well that this scenario also demonstrates the performance impact of doing online file growth during periods of large activity. This scenario can affect you with or without mirroring enabled, it’s just quicker to demonstrate with configuring mirroring.
I mentioned before that to fully understand this optic you need to be familiar with Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.
There is one simple lesson to be learnt here, so I’ll write it early on in the article:
Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.
Consider the following example scripts which are based upon the same routines as used in the previous post, but with some small changes.
These scripts do the following actions, one for a good scenario and one for the bad scenario.
1. Create a database named VLF
2. Create a starting backup point
3. Open a transaction in this database to ensure that the transaction cannot be truncated
4. Fill up the log with thousands of transactions until the log reaches the space limit set for it – in this case we test with a 5GB log. The test of the log running out of space is used just to guarantee log files of equivalent sizes for the recovery time comparison. This situation holds true for logs that haven’t run out of space.
5. Restart the server without taking a backup or dealing with the log space. This forces SQL to recover the entire contents of the active portion of the log, which is in effect, all of it.
We then just start the SQL Server service again and monitor the recovery times and look in the SQL Server error logs.
We notice a number of interesting points here:
1. The runtime required to fill up the 5GB log files is significantly faster when using the pre-grown log file. Less than 2 minutes for the pre-grown log file compared to over 9 minutes for the incrementally auto grown file. This demonstrates why auto-growths can be bad. They are expensive to manage and can harm your performance. The fact that the log fills quicker indicates that you are managing to throughput the transactions faster.
2. There is a delay of about 2 minutes when nothing is logged about the recovery phases of the VLF database in the bad scenario. This is the period in between the database starting up and the analysis phase beginning. In reality there is a large amount of work going on here; SQL Server is processing the 20000 VLFs that are present in the log file. This delay can be extended in proportion to the number of VLFs that you allow your log file to hold. I have seen real world examples where hundreds of thousands of VLFs have significantly delayed a recovery time.
3. The actual recovery times are similar (about 2 minutes) once the recovery actually begins. These recovery times are this length due to the fact that we created the old open transaction in the test meaning that SQL is forced to recover the entirety of the log.
4. As well as being significantly faster to fill up the log (in terms of runtime) which indicates that the processing of transactions is faster, when we look at the log in detail we see that we have processed about 6500 more transactions in the shorter period. Further evidence of how much better performance the pre-grown log files gives you. We’ve managed to process 25% more transactions in a quarter of the time!
The scripts and error log output are shown below. The scripts use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk. The number of rows that this command outputs represent the number of VLFs currently in the physical log.
SCRIPT 1 – Bad scenario
--CONNECT to the test server instance
use master
go
if exists (select name from sys.databases where name = 'vlf')
begin
drop database vlf --(if you need to)
end
--create a database on the root of C with a small starting log size
--which then grows in small increments
create database vlf
on (name=vlf, filename='c:\vlf.mdf')
log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=5GB,filegrowth=1MB)
go
use vlf
go
--create a dummy table to fill up
create table vlf_test
(
a int identity(1,1) not null,
b char(8000) not null,
c int not null
)
go
insert vlf_test values ('vlftest',0)
go
--create a starting point of backups
backup database vlf to disk = 'c:\vlf.bak' with stats = 1
go
backup log vlf to disk = 'c:\vlf1.bak'
go
--open a 2nd new connection and open an explicit uncommitted transaction
--to ensure that the transaction log cannot truncate
use vlf
go
begin tran
create table opentran
(
opentran int
)
Go
--switch back to the first connection
--now run the following to grow the log to create lots of VLFs
--run until the log runs out of space - a convenient stopping point
use vlf
go
set nocount on
go
declare @x int
declare @a char(8000)
declare @b char(8000)
set @a = replicate('a',8000)
set @b = replicate('b',8000)
set @x = 1
while 1=1
begin
update vlf_test set b = @b, c = @x where a = 1
update vlf_test set b = @a where a = 1
set @x = @x + 2
end
go
--let this run until the log runs out of space
--on the 2nd connection, use dbcc loginfo to check how many VLFs you have
DBCC LOGINFO (vlf)
--when you have reached 5GB of log file with several thousands VLFs
--simulate a server disaster by cutting the power to the principal server
--alternatively you could just power off your server ungracefully
shutdown with nowait
--now restart the SQL service or power on the server
I get the following results from the error log when I run this on my test server.
Runtime of filling the log: 9 minutes 21 seconds
Number of VLFs when log full: 20480
Time for VLF to recover fully: 3 minutes 40 seconds
Excerpts from the SQL Server log:
Server starts:
2008-05-16 13:22:40.85 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)
VLF db starts up
2008-05-16 13:22:47.97 spid18s Starting up database 'vlf'.
2 minute pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the large number of VLFs.
2008-05-16 13:24:34.28 spid18s Analysis of database 'vlf' (19) is 3% complete (approximately 8 seconds remain). This is an informational message only. No user action is required.
Redo has started
2008-05-16 13:24:45.80 spid18s Recovery of database 'vlf' (19) is 3% complete (approximately 108 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
Note how many transactions have been rolled forward – this is how many we did in the test
2008-05-16 13:26:21.36 spid18s 26533 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.
Undo has started
2008-05-16 13:26:21.39 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
Recovery complete
2008-05-16 13:26:21.49 spid5s Recovery is complete. This is an informational message only. No user action is required.
SCRIPT 2 – Good scenario
The script is identical to the bad scenario shown above, apart from the simple fact that we create the log to a pre-expand size of 5GB, using this command:
create database vlf
on (name=vlf, filename='c:\vlf.mdf')
log on (name=vlf_log, filename='c:\vlf_log.ldf',size=5GB,maxsize=5GB,filegrowth=1MB)
This gives the following results on the identical server.
Runtime of filling the log: 1 minutes 54 seconds
Number of VLFs when log full: 16
Time for VLF to recover fully: 2 minutes 2 seconds
Excerpts from the SQL Server log:
Server starts:
2008-05-16 13:40:16.10 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)
VLF db starts up
2008-05-16 13:40:22.64 spid18s Starting up database 'vlf'.
6 second pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the VLFs, but in this case because there are only 16, it completes much quicker. This is the fundamental difference between the examples.
2008-05-16 13:40:28.99 spid18s Analysis of database 'vlf' (19) is 2% complete (approximately 157 seconds remain). This is an informational message only. No user action is required.
Redo has started
2008-05-16 13:40:44.22 spid18s Recovery of database 'vlf' (19) is 4% complete (approximately 106 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
Note how many transactions have been rolled forward – this is how many we did in the test
2008-05-16 13:42:18.69 spid18s 33216 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.
Undo has started
2008-05-16 13:42:18.69 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
Recovery complete
2008-05-16 13:42:18.89 spid5s Recovery is complete. This is an informational message only. No user action is required.
I hope this demo gives further weight to the case for pre growing your SQL Server files in advance. Feel free to post a comment if you have a question.
Comments
- Anonymous
February 09, 2009
It is generally known that having a large number of virtual log files (VLFs) in a database transaction