Every single day I meet a number of customers who want to implement SQL Server AlwaysOn as part of their migration to SQL Server 2012. Though most of the discussions and architecture are easy to understand, most of the DBA’s ask a simple yet important question of what is the latency for Synchronous replica’s? What is the RTO and RPO for secondary servers? We have explained the same in Chapter 1 of our book. Let me bring out some of these concepts that I discuss in this blog for your reference.
Recovery Point Objective (RPO)
Recovery Point Objective (RPO) is defined as the amount of acceptable data loss or the point in time up to which the data can be recovered. Important metrics that can be used to monitor RPO when working with SQL Server AlwaysOn are:
The log send queue size at any point will give us an indication approximately how much log has not been sent in KB. This is the amount of log secondary doesn’t have at the time of failover and the amount of data loss customers will experience.
sys.dm_hadr_database_replica_states.last_commit_time column in DMV on both primary and secondary and look at the time difference. On the secondary the time will be the same as on the primary. The secondary rows on the primary will show the time reported back from each secondary. The difference of the time between the primary row and secondary row approximately represents the RPO assuming the redo is caught up and the progress is reported back.
Note: If there are more than one secondary replica, when making the decision to failover, analyze these values for all secondary replicas, as failing over to a secondary replica with more favorable last commit time may minimize data loss.
Recovery Time Objective (RTO)
Recovery Time Objective (RTO) is defined as the time taken to restore normal operations after a failure or planned failover. Important metrics for determining RTO when working with SQL Server AlwaysOn is:
The redo queue size divided by the redo rate should give the DBA some understanding of what this overhead time is before the database will be online and available for the primary role. So, Redo Catch up time = redo_queue_size(KB)\redo_rate (KB\sec). The values are reported in sys.dm_hadr_database_replica_states.
Note: If there are more than one secondary replica, when making the decision to failover, analyze these values for all secondary replicas, as failing over to a secondary replica with more favorable last commit time may minimize downtime.
This blog is inspired from Book content: SQL Server 2012 AlwaysOn Joes 2 Pros®: A Tutorial for Implementing High Availability and Disaster Recovery using AlwaysOn Availability Groups” (Paperback, Kindle).
Would love to also learn from you how you design and work on your RTO and RPO in your production environments with SQL Server AlwaysOn.