How to slowdown a SQL Server Database Backup
Yes, you read correctly. We are not talking about accelerating a SQL Server database backup. In our customer case we encountered the challenge to slow down the full database backup execution of a SQL Server database underneath the customer's SAP ERP system. How did we get into this situation? In short the story reads:
- We migrated a whole customer SAP landscape from AIX/DB2 to Windows and SQL Server 2014 SP1 CU4.
- Larger systems with BW and ERP around 9TB-10TB on DB2 originally.
- Thanks to SQL Server database compression and especially SQL Server columnstore implemented in BW, we got the databases to around 7TB for ERP and 3.7TB for BW.
- In order to get great performance in principle and especially for the backup, we had 64 data files per database that were located on 32 distinct volumes. Backup destinations were on 4 distinct volumes where we backed up against 16 backup files.
- Databases were all encrypted by SQL Server TDE (https://msdn.microsoft.com/en-us/library/bb934049(v=sql.120).aspx ).
- Backups were taken on the AlwaysOn secondary replica that was supplied in a synchronous manner.
- The backups were done without SQL Server backup compression (https://msdn.microsoft.com/en-us/library/bb964719(v=sql.120).aspx ) since backup compression is not efficient in compressing with databases that are TDE encrypted.
- Storage bandwidth was around 2GB/sec to the storage from each of those servers. Also from the storage backend our tests showed that the I/O bandwidth could be sustained.
- In tests we were pretty successful in pushing those bandwidth limits in performing backups.
So far so good. However, as all went into production the customer realized that the great backup performance did have impacts on the infrastructure. This again affected the production workload. So the task we were facing was to slowdown the IOPS the SQL Server backup activity was generating by a factor of 2. However, without changes in the disk configuration or changes to the number of backup files.
SQL Server backup can be throttled or tuned a bit by e.g. defining the parameter BUFFERCOUNT in the backup statement. However, impact is not that extreme to get to the factor 2 less IOPS as we needed. You also can't force SQL Server to a buffercount=1 since SQL Server will, as a minimum, allocate 2 buffers per media stream. Since we have at least a stream per disk volume, we will end up with a number of buffers. So something else needed to be found.
So we looked into SQL Server Resource Governor (https://msdn.microsoft.com/en-us/library/bb933866(v=sql.120).aspx ). SQL Server Resource Governor introduced the capability to create resource pools that limit the I/O activity per volume (MAX_IOPS_PER_VOLUME option - https://msdn.microsoft.com/en-us/library/bb895329.aspx ). It sounded like an awesome idea to exactly limit the IOPS the backup would generate. The idea was to use performance monitor to determine the IOPS per volume and to throttle. Since we wrote against four volumes and read from 32, it certainly would be the write IOPS that we would throttle. We did not use backup compression. Hence the accumulated number of reads and writes should be the same. So we did a test and did not see any impact of anything governed. Means the backup blasted away as before. So first check, is the session we used to perform the backup even assigned to the resource group? You can find that out in the SQL Server DMV sys.dm_exec_sessions (https://msdn.microsoft.com/en-us/library/ms176013.aspx ). Check for one of the last columns called 'group_id'. The default resource group that all client sessions go through has the value of '2'. It is expected that using our defined resource group, there should be a 3 digit number of >255. That was the case. Nevertheless, the I/O activity of the backup execution did not get throttled and the IOPS in writing were way beyond the limit we set. Why is that?
Answer: Before SQL Server 2016, the I/O activity of the backup was not able to be limited with the MAX_IOPS_PER_VOLUME option of a resource pool. This only became possible now with SQL Server 2016. However, the customer was running SQL Server 2014. So a miss here.
Another colleague in SQL Server development recommended to take a look into this article: https://msdn.microsoft.com/en-us/library/cc280384(v=sql.120).aspx . That is where we formed an idea:
- So far we were not using backup compression since it did not make any sense to use it for a TDE encrypted database.
- But since we are taking the backups from an AlwaysOn secondary replica, there was ample of CPU resources available.
- So let's use compression for the SQL Server database backup and limit the CPU that compression can leverage as described in this article. And with that indirectly slowdown the backup activity and with that the amount of I/O generated against the storage infrastructure.
Following step-by-step what this article https://msdn.microsoft.com/en-us/library/cc280384(v=sql.120).aspx described, we created the resource pool and group, created our classifier function also based on a SQL Server login and tried again. And, it did not work. Despite the fact that we set the option MAX_CPU_PERCENT to half the CPU resource consumption the backup compression took in the exercise before, we saw no effect. Backup compression continued to take as much CPU resources as before and clearly drastically more than we had set in the MAX_CPU_PERCENT for the resource pool. And the resource group/pool was used by the session that executed the backup. Why was that?
Answer: We did not read the article closely. Already in the first paragraph it states: 'Therefore, you might want to create a low-priority compressed backup in a session whose CPU usage is limited by Resource Governor when CPU contention occurs'. Looking closer into the description of this article: https://msdn.microsoft.com/en-us/library/bb895329.aspx , it states about the option we were using only coming into play when the overall CPU consumption of the server or VM is high (contention). However, in our case, we were, besides some redo thread of AlwaysOn, the only consumer and even with backup compression had ample of CPU resources left. So no CPU contention and hence no throttling.
Solution: There is another option available when defining a resource pool. This option is named 'CAP_CPU_PERCENT'. This option allows to cap the CPU usage of a task assigned to a resource pool to a certain percentage independent of the overall availability of CPU resources. So we gave that a try. And see it worked. Now we had it. We had a lever with which we, indirectly though, could influence the IOPS that got generated from SQL Server backup. Not exactly the direct way. But a way that can work up to the point where either the infrastructure issues get resolved or SQL Server 2016 comes in use with this customer.
Needless to state that the user context (Windows Authentication) chosen to classify the session with executing the full database backups is a user context that only gets used for the full database backups and for nothing else. With that we avoid any other task using the resource pool.
That was it. The strange task to slow down a SQL Server backup with SQL Server 2014. But it worked.
Comments
- Anonymous
August 10, 2016
Many customers have this problem. Backup is designed to max out the IO system. This design goal implies that other workloads suffer.Please address this severe issue that pretty much every customer has.For example, add a throttle percentage option to backup that causes the backup operation to leave a certain percentage of resources unused. - Anonymous
August 22, 2016
Brilliant story, Jürgen!Thanks for sharing!Kind regards from LinzRudi- Anonymous
August 22, 2016
BTW: I recently fell over a metapher for this kind of problems: Noisy neighbor.
- Anonymous
- Anonymous
October 17, 2016
The comment has been removed - Anonymous
October 21, 2016
From the link .. https://msdn.microsoft.com/en-us/library/bb895329.aspxMAX_IOPS_PER_VOLUME is available from SQL 2014 thru SQL 2016... Any reasons why this setting will not work on SQL 2014, please? - Anonymous
February 28, 2017
Today was the day ...Due to an error in the job schedules we had a full backup of the KIS database during prime time.8 vCPUs went into the 90s and the system felt really slow.I should have a deeper look into the abovementioned instructions ... - Anonymous
June 07, 2017
We did it the other way round ...We saw that the full backup made reads of approx. 1 GB/s on two DB volumes and a write rate of approx. 350 MB/s using 6 backup files. An this made a CPU load of 95 % for 13 minutes, what made a noticable impact on the workload.We asked the storage guys to set a bandwidth limit on the LUN for the backups of 200 MB/s.This capped the CPU load to 60 % and raised the run time of the full backup from 13 to 23 minutes.I am aware that this is a very specific solution in a specific landscape.But maybe it can help somebody out there.Kind regards from Linz, Rudi