SQL Server AlwaysOn – RPO and RTO discussion September 25th, 2013

Vinod Kumar

Every single day I meet a number of customers who want to implement SQL Server AlwaysOn as part of their migration to SQL Server 2012. Though most of the discussions and architecture are easy to understand, most of the DBA’s ask a simple yet important question of what is the latency for Synchronous replica’s? What is the RTO and RPO for secondary servers? We have explained the same in Chapter 1 of our book. Let me bring out some of these concepts that I discuss in this blog for your reference.

Recovery Point Objective (RPO)

Recovery Point Objective (RPO) is defined as the amount of acceptable data loss or the point in time up to which the data can be recovered. Important metrics that can be used to monitor RPO when working with SQL Server AlwaysOn are:

The log send queue size at any point will give us an indication approximately how much log has not been sent in KB. This is the amount of log secondary doesn’t have at the time of failover and the amount of data loss customers will experience.

sys.dm_hadr_database_replica_states.last_commit_time column in DMV on both primary and secondary and look at the time difference. On the secondary the time will be the same as on the primary. The secondary rows on the primary will show the time reported back from each secondary. The difference of the time between the primary row and secondary row approximately represents the RPO assuming the redo is caught up and the progress is reported back.

Note: If there are more than one secondary replica, when making the decision to failover, analyze these values for all secondary replicas, as failing over to a secondary replica with more favorable last commit time may minimize data loss.

Recovery Time Objective (RTO)

Recovery Time Objective (RTO) is defined as the time taken to restore normal operations after a failure or planned failover. Important metrics for determining RTO when working with SQL Server AlwaysOn is:

The redo queue size divided by the redo rate should give the DBA some understanding of what this overhead time is before the database will be online and available for the primary role. So, Redo Catch up time = redo_queue_size(KB)\redo_rate (KB\sec). The values are reported in sys.dm_hadr_database_replica_states.

Note: If there are more than one secondary replica, when making the decision to failover, analyze these values for all secondary replicas, as failing over to a secondary replica with more favorable last commit time may minimize downtime.

This blog is inspired from Book content: SQL Server 2012 AlwaysOn Joes 2 Pros®: A Tutorial for Implementing High Availability and Disaster Recovery using AlwaysOn Availability Groups” (Paperback, Kindle).

Would love to also learn from you how you design and work on your RTO and RPO in your production environments with SQL Server AlwaysOn.

Share this article

Tags: , , , , , , , ,

This entry was posted on Wednesday, September 25th, 2013 at 10:22 and is filed under Technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


4 Responses to “SQL Server AlwaysOn – RPO and RTO discussion”

  1. Nilmov says:

    Good to know the RTO and RPO but when i am designing , how can i design the RTO and RPO??Isnt that figures up in the air and the answer is it depends….

    Thanks. That was certainly an useful info.

    • Vinod Kumar says:

      You are correct in a way. There are other factors and features of SQL Server like “Indirect CheckPoint” that can also influence your RTO, RPO as a single instance. This article was fundamental from an SQL Server AlwaysOn discussion that I get from customers. So thought to write them down :).

  2. Girijesh says:

    Informative!

Leave a Reply



 

Email
Print