ScaleArc

Architecting for Uptime – Avoiding App Interruptions during Database Failover

My Hero ZeroWe got a lot of great questions on yesterday’s webinar – Achieving Zero Downtime during Unplanned Outages. We also ended up discussing the same topic in great depth at the Silicon Valley SQL Server User Group last night, again answering a bunch of questions along the way. This synopsis should help anyone looking to architect for increased uptime and app availability.

What’s the difference between what AlwaysOn does on its own vs. what happens with ScaleArc added to the mix?

ScaleArc enhances AlwaysOn with these capabilities:

AlwaysOn provides replicas, but it doesn’t offer any geographical awareness in the load balancing amongst the replicas. When you’re building for resiliency across data centers, you want to serve reads locally. Why send a user’s read request to New Jersey if he’s sitting in California and there’s a data center in Arizona? But the AlwaysOn Availability Group just serves up a list of available read servers, and apps go to them randomly. You’ll get faster app performance with geo-aware load balancing.

For cross-data center failover, AlwaysOn has no notion of a universal IP address. DNS name failover also doesn’t work across data centers. With ScaleArc, the failover is seamless, with the app not even aware the destination server has changed. Most importantly, apps don’t hang during failover – see the next question on what happens to connections.

AlwaysOn support synchronous and asynchronous replication. Between data centers, you need to set up asynchronous because it would take too long to commit writes to the replicas, frustrating your users. With asynchronous, you have to be aware of how far out of synch the data is and determine whether it’s safe to serve. ScaleArc takes care of this automatically, relying on a user-set threshold for how much lag in synch time is acceptable. No changes to the app – we just automatically monitor for stale data and won’t serve it up.

What happens to app connections during failover?

In a typical database failover, all connections between the app and the primary server fail. At the same time, when a secondary is being promoted, all connections to that secondary also fail. Since all these connections fail, users see a huge amount of app errors. In many cases, because the apps can’t reestablish the connection until the secondary is fully operational as the new primary, the timeout is so long that the apps hang. Then the apps have to be restarted.

With ScaleArc, we don’t drop the connections – only those queries that were past ScaleArc and not yet completed at the database will fail. ScaleArc is able to do two things to reduce app errors. First, ScaleArc maintains the connections to the secondary that is being promoted to the primary, so apps don’t error out. Second, ScaleArc maintains a queue, so queries that were in flight to ScaleArc and not yet over to the database are held and then sent to the new primary.

With this approach, ScaleArc is able to dramatically reduce the app errors – those delays just look like simple TCP delay to the app instead of sending back errors or causing the app to hang.

How do you size the number of servers in a given replication set up? Is there a minimum? Is there a point of diminishing returns?

For many customers looking to upgrade to SQL Server 2012 or SQL Server 2014, replication is a new architecture, so we gets lots of “best practices for replication” types of questions. For guidance on server count, look at your % reads vs. writes. If your app is 95% reads, adding replicas all the way up to the max in SQL Server 2014 – 1 primary with 8 secondaries – will add read capacity linear with your app, improving performance with each new server. If on the other hand you’re at 70% reads, with 30% writes – which is unusually high – then by the time you’ve added 3 replicas, you’ve maxed out the capacity. In that architecture, you’ve got 75% of your database capacity (3 out of 4 total server) for reads, so you’ve matched resources to traffic patterns.

With ScaleArc, this support for read/write split is automatic – you don’t have to go back in and reprogram your app to add read-intent strings to direct reads to your secondaries. Plus, you’re offloading your primary server, so now your write performance will also improve.

Lots to Learn

With new replication and failover techniques coming in SQL Server 2012 and SQL Server 2014, it’s the right time to be learning how your apps can benefit and how you can design for reduced downtime – both planned and unplanned. To find out how to avoid downtime when patching database servers and performing other maintenance tasks, join us for part two of our zero downtime webinar series.

comments powered by Disqus

Recent Blog Entries

  • November 15, 2017
    Helping Inmates Stay Connected to Family
    More »
  • October 12, 2017
    ScaleArc on Google – Hitting the Cloud Trifecta
    More »
  • September 19, 2017
    Acceleration Adoption of Azure SQL Database
    More »
  • September 7, 2017
    More ScaleArc Magic – Speeding up Apps with Wrapped Transactions
    More »
  • August 15, 2017
    Prepping for Black Friday? You’re Late!
    More »
View All Blog Posts »

Related posts