SQL Server 2012 AlwaysOn – Part 11 – Performance Aspects and Performance Monitoring I
[Edit] You can find the summary of the complete series here.
In this part of the series we want to look a bit more into performance aspects in deploying AlwaysOn and the way how one can monitor the requirements and also the impact of AlwaysOn.
What are potential impacts?
Independent on the Availability Mode being synchronous or asynchronous, AlwaysOn will add network load to the involved systems and infrastructure. As AlwaysOn works, the moment the transaction log buffer in the primary replica is flushed and persisted the same content needs to be moved to the replicas. The transaction log buffer on the primary usually gets flushed with the commit of a transaction, a checkpoint or when the buffer is full. With a workload performing small amount of changes and immediately committing, we are seeing a lot of small buffers persisted and with that a lot network packets of small volume to the secondary replica(s). The complete opposite would be cases of workload modifying a lot of data before committing where the transaction log buffer often is flushed because its maximum size of around 60K is reached. In such cases we usually are seeing less network I/O to the secondary replicas. But each network I/O has a larger volume.
It also needs to be noted that the more replicas are involved the more network traffic will be generated since every secondary replica is supplied separately by the primary replica.
SAP usually is configured in 3-Tiers with the application layer often sending thousands of batches per second to the DBMS server. In larger systems one already needs to care about the configuration of multiple network cards and the fact that scalability in handling network I/O on the database server is in place. Hence using AlwaysOn, will even make the configuration and the capabilities of the network configuration on the DBMS server even more important.
In case of using synchronous Availability Mode added latency could impact the performance of the SAP system. In the asynchronous case, the primary is not waiting on an acknowledge of the I/O to the transaction log from the secondary replica. However in the synchronous mode, the commit towards the application is delayed until an acknowledge from the synchronous secondary replicas return.
When do the secondary replicas return the acknowledge? In the synchronous case, the acknowledge is sent when the content sent to the replica is persisted in the transaction log of secondary replica. Means as latency in the synchronous case we roughly need to add:
- The time the package is spending ‘on the wire’ to the secondary replicas
- The time it takes to persist the package in the transaction log of the secondaries
- The time it takes to send the acknowledge back to the primary replica
Without doubt, the latency ‘on the wire’ is dependent on the distance and often as well the number of switches.
Where do we see the impact of synchronous Availability mode in SAP? When we usually look at a SAP system and want to find out what part of the overall response time the database causes, we usually look into SAP transaction ST03 which gives us possibilities of slicing through the performance data in different granularities and different ways. In the example below we chose to categorize the data by the different SAP request types.
In the past we usually saw impact of AlwaysOn or Database Mirroring mostly around the Avg. DB Time of the update requests and the Background requests. To a lesser degree we experienced the impact on dialog response time and RFCs. But let’s not get mistaken. With too high latencies between the primary and secondary replicas, interactive users whose requests are handled in ‘dialog’ workprocesses will be affected as well and if only by second or third order effect of locks being held longer on the database.
Step 1: Some things to think about before starting with AlwaysOn in non-shared disk configuration
The goal of AlwaysOn in a non-shared disk configuration is to get the transaction log buffers transferred to the secondary as fast as possible in order to have the content persisted there. That is essential for minimizing or avoiding data loss. Doesn't matter whether the data is immediately redone or not on the secondary replicas, once it is persisted in the transaction log of the secondary replicas, an essential task of AlwaysOn is fulfilled. Hence it is not acceptable to argue that it doesn't matter that data to be sent to the replicas is piling up in the transaction log of the primary replica because the network infrastructure can’t handle the load. The first underlying goal of when using AlwaysOn is to move the data describing the changes on the primary replica as soon as possible to the secondary replicas to minimize eventual loss of committed data.
We also discourage from using hardware setups and configurations which are vastly different in their performance characteristics. As mentioned earlier, especially in the synchronous case, the I/O latency on the synchronous secondary replicas does count and plays an important role
Adding automatic failover on top of synchronous availability mode, one also should have good processor performance and good I/O latency for the data files since we don't want to pile up huge volumes of data to be redone on the secondary replicas. In the case of failover this would delay one of the secondary replicas moving into the primary role [CG2] .
Step 2: Let’s get familiar with the Performance Counters for AlwaysOn
Comparing to Database Mirroring (DBM), AlwaysOn required some serious rework in the counters and in the structure of counters, despite the fact that the transfer of the data is more or less following the very same path as in DBM. Reason for this is:
- The fact that we can handle more than one secondary replica
- That an Availability Group can contain more than one database
As a result, we look at two groups of counters for AlwaysOn:
- SQL Server:Availability Replica – On the primary replica a set of counters will exist for every secondary replica. Every secondary replica is represented as an ‘instance’. In case of looking into this Counter group on a secondary replica, just the primary replica is shown as an instance. Means the primary does record the data sent and received from each of the secondary replicas and the individual secondaries just record what they received and sent from/to the primary
- SQL Server:Database Replica – needed to be introduced in order to get the specific data for each of the different databases within an Availability Group. The big difference to DBM is that the failover group now can have multiple databases. Though replication of data between the different databases within an AG is not synchronized, it is very important to be able to identify which of the databases eventually piles up lock content to be transmitted or which database is causing most of the network traffic.
Another few new counters got added to SQL Server:Databases. These counters give information about the SQL Server Log Cache of each individual database which is playing an instrumental role when transferring data to the secondary replicas or on the secondary for redo operations (also check out the graphics at the end of this article to see the role of the Log Pool: https://blogs.msdn.com/b/saponsqlserver/archive/2012/02/07/sql-server-2012-alwayson-what-is-it.aspx )
The counters are named:
- Log Pool Cache Misses/sec – showing the number of log blocks which could not be found in the Log Cache
- Log Pool Disk Reads/sec – Number of Reads issued by the Log Cache to read log blocks from disks
- Log Pool Requests/sec – Number of log block reads per sec
In all the next steps we will also relate to a list of other counters which we will name in order to investigate the health of the system.
Step 3: Analyze and monitor Data transfer and network volume between replicas
As a first step when introducing AlwaysOn in a non-shared disk configuration, we need to make sure that the network infrastructure can handle the additional network volume and the additional network I/Os on the primary DBMS server.
The best tool to measure and log data around AlwaysOn and network is the SQL Server Performance Monitor plus checking some of the wait types SQL Server is exposing in its DMVs.
The counters we want to look at for investigating for this case:
Of the Network Interface Performance Counters we want to collect the counters:
- Bytes Received/sec
- Bytes sent/sec
- Packets Received/sec
- Packets Sent/sec
- Current Bandwidth
For every NIC or NIC Team instance which is used.
From the ‘Processor Information’ Collection (or ‘Processor’ Collection if Windows Server 2008 is used) we want to record the counters:
- %DPC Time
- % Privileged Time
For every single processor instance
We want to record all the counters of the SQL Server:Availability Group Replica separately for each of the secondary instances.
We also want to collect all the counters of the SQL Server:Database Replica separately for every database instance within the AG.
Another counter to be collected is: SQL Server:Memory Manager - Log Pool Memory (KB)
Of the ‘Logical Disk’ Counter Group add the following counters for the volumes the SQL Server transaction log and data files are stored on:
- Avg. Disk Bytes/Write
- Avg. Disk sec/Write
- Disk Write Bytes/sec
Of the SQL Server:SQL Statistics counter group, please select the following counter:
- Batch Request/sec
Having these counters set up for collection on the primary and the secondary instances, we run a rather representative workload having our asynchronous AlwaysOn setup working. After the workload runs we now investigate what we recorded on performance counter data in order to answer a few questions.
Is there an indication that the network develops into a bottleneck?
First check for the primary replica which had to handle the workload and the replication of the data would be to investigate the Counter of the ‘Current network Bandwidth’ for each instance and compare that value with the sum of ‘Bytes Received/sec’ and ‘Bytes Sent/sec’ for the very same NIC or NIC teaming instance. If the sum of the two values is not reaching the value of current bandwidth or better has quite some headroom, the network capacity in terms of data volume transmitted seems to keep up. However please keep in mind that AlwaysOn can contribute to significant data volume transferred when e.g. performing index builds/rebuilds in a parallel manner or having massive loads going into the database. It easily can happen that such activity creates around 50-100MB/sec on data volume which is written into the log and which needs to be moved to the secondary replicas.
However knowing that the theoretical network bandwidth is not good enough to know that the network can keep up. The question is whether we could use the network bandwidth. In order to investigate that we want to check whether none of the CPUs of the primary server is mostly busy handling DPCs resulting out of network traffic. For that same reason we want to check the performance counter ‘% DPC Time’ for each of the CPUs. Situations we DON’T want to see is one CPU being mainly busy or completely busy with handling DPCs. This would indicate that Windows RSS is not working. In such cases the ability of the whole networking could be severely capped, besides having enough bandwidth available. Please refer to this article to take further steps to get around this bottleneck: https://blogs.msdn.com/b/saponsqlserver/archive/2012/01/12/network-settings-network-teaming-receive-side-scaling-rss-amp-unbalanced-cpu-load.aspx
Are we piling up Log Content in the primary replica ?
One of the most important questions to be answered since even the best network capabilities are useless if we still are generating more log content than we can somehow transfer over to the other side. Not being able to move the transaction log data to a replica has two effects:
- The asynchronous secondary replica which can’t receive the volume of data in time as it gets generated is falling back and in case of a manual failover forced by an outage of the primary replica will encounter data loss.
- The transaction log on the primary replica is forced to keep the data which has not been sent yet to a replica physically in the transaction log. Even transaction log backups will not be able to truncate that data. In extreme cases where a secondary replica falls back more and more, situations of full transaction logs can occur. This also is true for the case that the data synchronization to a replica is suspended. For the time of the suspension the data which, due to the suspension has not been sent yet to the suspended replica will remain in the transaction log of the primary (and with that in the transaction log of all the active (non-suspended replicas). Transaction log backups will not free up the space in the transaction from all the log content that hasn't been yet sent to the secondary replicas.
- If one would switch a configuration where a secondary replica is not able to receive the data volume as it is generated to a synchronous relationship, the processing on the primary replica would slow down significantly since the network throughput between the primary and the secondary replica will determine when the primary replica can acknowledge a transaction to the application.
Hence the fact of falling behind in getting the log data sent to a secondary with an asynchronous AlwaysOn configuration would be a clear sign to NOT even attempt a synchronous configuration.
There are two ways in order to recognize/monitor this fact. An easy one is to run this query on the primary or the secondary replicas:
SELECT ag.name AS ag_name, ar.replica_server_name AS ag_replica_server, dr_state.database_id as database_id,
dr_state.log_send_queue_size, is_ag_replica_local = CASE
WHEN ar_state.is_local = 1 THEN N'LOCAL'
ELSE 'REMOTE'
END ,
ag_replica_role = CASE
WHEN ar_state.role_desc IS NULL THEN N'DISCONNECTED'
ELSE ar_state.role_desc
END
FROM (( sys.availability_groups AS ag JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id )
JOIN sys.dm_hadr_availability_replica_states AS ar_state ON ar.replica_id = ar_state.replica_id)
JOIN sys.dm_hadr_database_replica_states dr_state on
ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id;
On the primary replica it will give back a result set like:
As one can see the log_send_queue_size is named with 60 KB (unit of data shown in this column). This is the normal value which is shown all the time and the value expected here. For the primary replica there is no log send queue. Hence the NULL in the result set. If a replica really falls behind, the result could look as well like:
In this case one of the replicas had a log send queue of over 550MB accumulated on the primary side. If this number is growing under the workload, it becomes clear that there is an issue to supply this replica with the data moved. If the same query is executed on the secondary which causes the pile up of the log queue, the result would look like:
As expected we only can see the ‘local’ data when querying on one of the secondary replicas.
Note: Be aware that the log_send_queue_size value is not available for the case that data movement to replicas is suspended manually or the primary loses connection to the secondary replica. This value only can be populated when the secondary replica is alive and active in the AG. Means executing the query shown, the log_send_queue_size would show a value of 0 in case the data movement to this secondary replica is suspended.
Second possibility is analyzing our Performance Counter collection we ran while the workload(s) were running. On the secondary replicas, we want to look at the following counters for each of the databases in the AG:
- SQL Server:Database Replica - Log Send Queue
This is the counter which tells how much kilobytes of transaction log content is remaining on the primary to be transferred to the specific replica.
What we want to see ideally is the value of the value of 60KB throughout in a straight line and eventual burst which can go a bit higher. But what we don't want to see are situations where the log send queue is increasing for minutes only to come back to normal over minutes when the workload eases off a bit. Or extreme cases where the log send queue increases steadily through the hours only to subside during hours of low workload level. In such cases we highly discourage attempting to use synchronous AlwaysOn configurations between the primary and the secondary replica
Important: You also want add some index build/rebuild activity during your workload and monitor this counter.
Note: Despite the fact that you can see the log send queue of secondary replicas on the primary replica through DMVs, the Performance Counters on the primary replica will not offer a counter to check these values of the secondaries.
However looking on the Primary, we will eventually see one additional side effect of the fact that a secondary cannot receive the transaction log data in time. Looking at the Performance counter SQL Server:Memory Manager - Log Pool Memory (KB) on the primary replica we will realize that the Log Pool (referenced as Log Cache in earlier parts) will grow when a secondary falls back. In an attempt to avoid reading the data to be transferred to the secondary replicas directly back from disk, SQL Server will try to cache as much of the data that has not yet been transferred to one of the replicas. Hence the cache can grow and take valuable space out of the buffer pool on the primary.
In the case we showed above where one of the secondary replicas is keeping up with the stream of data and the other one is not, the case is clear. One would need to check the network bandwidth and network handling to and on the secondary replica. If it is a general problem, with all replicas, then again one would need to check the network to those servers and how those handle the network requests. On the primary one should check as well whether there is some network handling (like no Receive Side Scaling causing the problem). A view into those counters:
- SQL Server:Databases - Log Pool Requests/sec
- SQL Server:Databases - Log Pool Disk Reads/sec
- SQL Server:Databases - Log Pool Cache Misses/sec
- SQL Server:Memory Manager - Log Pool Memory (KB)
Ideally we do not get any Log Pool Disk Reads/sec and hardly any Log Pool Cache Misses/sec. If both counters are high it means that the log pool doesn't contain the data to be sent to the secondaries. Instead the data needs to be read from the physical disk which again could have a negative impact on the workload running on the primary replica.
Third possibility is to look at it with the AlwaysOn Dashboard in SQL Server Management Studio. You can configure the columns shown in the dashboard by right-clicking on the grey bars and then selecting the columns you want to see. In our case mainly the Log Send Queue
Are we able to apply all the data to the secondary in time?
Getting the data persisted to the secondary is the most important part for AlwaysOn in order to make sure that data loss is minimized (asynchronous) or not even happening (synchronous). However in order to provide a fast automatic failover or a great readable secondary story, we also rely on the fact that the changes, as persisted in the transaction log of the secondary replicas, are getting applied to the data. Please remember that the secondary replica will acknowledge to the primary when the data sent has been persisted in the local log and NOT when the changes have been applied. Therefore what we want to monitor/analyze as a third step is whether we are able to apply all those changes in reasonable time.
Again we can use our AlwaysOn DMVs to analyze this one during the workload. Use this query:
SELECT ag.name AS ag_name, ar.replica_server_name AS ag_replica_server, dr_state.database_id as database_id,
dr_state.redo_queue_size, is_ag_replica_local = CASE
WHEN ar_state.is_local = 1 THEN N'LOCAL'
ELSE 'REMOTE'
END ,
ag_replica_role = CASE
WHEN ar_state.role_desc IS NULL THEN N'DISCONNECTED'
ELSE ar_state.role_desc
END
FROM (( sys.availability_groups AS ag JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id )
JOIN sys.dm_hadr_availability_replica_states AS ar_state ON ar.replica_id = ar_state.replica_id)
JOIN sys.dm_hadr_database_replica_states dr_state on ag.group_id = dr_state.group_id and
dr_state.replica_id = ar_state.replica_id;
The query result could look like:
The column ‘redo_queue_size will give you the volume of log records which still need to be applied to the data. The unit used is kilobytes. As we see in the result set, the one secondary has around 1.3 MB to be applied whereas the other one has around 260MB to be still applied. As it was the case with the query before, issuing the query on a specific secondary will only show the data for this one secondary. There also will be no accurate data (just the last value) if the replica is suspended or disconnected from the AG.
As with the Log Send Queue, the value of the actual ‘Redo Queue Size’ can also be visualized in the AlwaysOn Dashboard within SQL Server Management Studio.
Analyzing the data of our Performance counter recording we focus on the following Performance Counters on the secondary replica(s):
- SQL Server:Database Replica - Recovery Queue
- SQL Server:Database Replica - ReDone Bytes/sec
- SQL Server:Database Replica - Redo blocked/sec
There is no need to look at the counter SQL Server:Database Replica - Redo Bytes Remaining. Despite the close proximity in terminology, this counter has some other purpose. The first counter (Recovery Queue) will be the equivalent to what you are seeing in the DMV sys.dm_hadr_database_replica_states as column redo_queue_size (used in our query above). As in the DMV the value is in kilobytes. The second counter (Redone Bytes/sec) gives an indication on how fast the redo is performing. Independent of the number of cores or CPU threads of the server, there only is one redo thread per database in an AG. Hence getting a feel for how much this thread can redo is important and necessary to set it in correlation with the Recovery Queue.
In our example where our query against the DMVs gave us a backlog of 1.3MB to still be redone on a secondary, this alone might sound dramatic. But when we now check the volume which is redone per second, the whole picture starts to look way better. As an example the screen shot of performance showing the same situation on that secondary replica
The blue line is display the Recovery Queue in kilobytes. The red line does display the bytes redone per second and adjusted in the scale to display in kilobytes. As we can see at all point in time the redo thread was fast enough to immediately apply the changes. Based on what we see in this screen shot one can state that at any point in time, the data changes got applied in less than a second after those were persisted in the transaction log of the secondary.
The DMV sys.dm_hadr_database_replica_states also has a column called redo_rate. Be aware the calculation of the number displayed there does not match the calculations the performance counter is showing. The performance counter value is more up to date so to speak.
The third counter (Redo blocked/sec) would give an indication how often redo operations were blocked by read-only operations when using the readable secondary feature. Since we are using snapshot isolation level to read against the data, such blocking situations are not occurring against the usual data. However there might a possibility where metadata within the read-only database might need to be updated, but other operations on the read-only secondary accessing that metadata might have a short lock.
The screen shot shown of the performance counters above show kind of an ideal picture. What are the more concerning cases? One could list them like:
- The Recovery Queue is steadily increasing under workload and would in the scenario above cross the line of redone bytes/sec further and further
- As an effect of the Recovery Queue getting larger and larger
At the beginning of running the workload the recovery queue (bright blue line) did increase and to a larger degree did exceed the Redone bytes/sec (red line). As a side effect we also can see that the Log Cache (Pool) on the secondary is increasing and with that taking memory away from the buffer pool. To a degree that can have adverse effects since for the redo more dirty pages might have to be pushed out to disk in order to reload other pages which need to be changed. In the last part, as the workload was going down, one sees how the recovery queue winded down. As this took place, the Log Cache/Pool also adjusted. Periodically the Log Cache/Pool will check which transaction Tlog buffers already got applied. Flush those and as a result adapt the Log Cache/Pool
Let’s assume, all our tests showed that we are fine in regards to transferring data via the network over the secondary replicas, that we are no queuing up log records and that we also seem not to create redo queues on the secondary which all still run in asynchronous availability mode, we now can discuss the synchronous availability mode. Something where we look at monitoring in the next part of the series.
Comments
Anonymous
December 26, 2013
Amazing posts. Thank you very much. Just so you know, you missed the "Always on" tag on this post.Anonymous
January 05, 2015
Very informative. Regards Sufian www.sqlship.wordpress.comAnonymous
November 18, 2015
ThanksAnonymous
March 07, 2017
Good article and lots of learning