ScaleArc

Architecting for Zero Downtime at zulily

I recently had an opportunity to hear Don Allen discuss with other IT peers some best practices on architecting for zero downtime in database environments. Don’s company, zulily, is an eCommerce retailer that focuses on delivering customers special finds every day, ensuring customers have fast and constant access to the zulily site is essential.

As the head of technical operations at zulily, Don has a small team of IT professionals handling everything from the phones to the data center infrastructure. To power their website content, they run a MySQL cluster – it has three nodes: one primary, one readable secondary, and a third that functions as a DR node with a two-hour delay in the unlikely case of data corruption.

Because of the nature of zulily’s business, they need a database infrastructure that is fault tolerant and highly available. MySQL alone does not fit that requirement. Don’s team has to hard code connection strings at the application layer to ensure specific database nodes receive traffic. They can play with the DNS to redirect the connection string to a specific database server. The problem is that it takes only one mistake to break replication in MySQL and everything comes crashing down. Don went searching for a better way.

The path to zero downtime

Don has been working in the Seattle area for companies like Costco and Drugstore.com. He was accustomed to having enterprise grade high availability databases like Oracle RAC and SQL Server. When he joined zulily, he no longer had access to these highly available database solutions. About two years ago, Don was working with an infrastructure advisor when he learned about ScaleArc’s database load balancing software. It sits between their applications and their MySQL cluster. Instead of hard coding the connection strings to a specific database node, they just point the apps to ScaleArc. This architecture lets them off load database engineering to ScaleArc, removing the need for zulily’s developers to code this functionality. Now they can scale database nodes on the fly, with no app changes, and they also get deep metrics on their SQL traffic. Having these SQL logs helps them with capacity planning and other insights that help them continue to improve operations.

During the lunch discussion, another IT staffer asked how zulily handles maintenance windows since they are a 7/24/365 operation. Don’s answer was simple: They don’t have maintenance windows anymore. ScaleArc lets them do upgrades or patch without interrupting the app. In the ScaleArc GUI, they mark the database server that they want to patch or upgrade as offline. The ScaleArc software then gracefully bleeds that server of all connections and workloads, using the other available servers. When they are done patching, they bring the server back online and ScaleArc adds it back into the cluster rotation and starts serving traffic to it.

ScaleArc has really changed how zulily does operations, and it has dramatically improved their uptime. They are looking at whether they can take advantage of ScaleArc’s caching to boost performance. Don’s team is also looking at moving some components to the cloud. Luckily, ScaleArc can follow them there.

To learn more about ScaleArc’s database load balancing software, click on this link.

comments powered by Disqus

Recent Blog Entries

  • November 15, 2017
    Helping Inmates Stay Connected to Family
    More »
  • October 12, 2017
    ScaleArc on Google – Hitting the Cloud Trifecta
    More »
  • September 19, 2017
    Acceleration Adoption of Azure SQL Database
    More »
  • September 7, 2017
    More ScaleArc Magic – Speeding up Apps with Wrapped Transactions
    More »
  • August 15, 2017
    Prepping for Black Friday? You’re Late!
    More »
View All Blog Posts »