There are plenty of lessons to be learned from the latest Amazon outage, and we thought this would be the opportune moment to address the way Xeround was affected (or not-so affected) by the outage, and how we handle availability in the cloud of EC2 MySQL apps.
First things first: During the Amazon outage, Xeround costumers did not suffer any disruption to their live DB instances! The recent outage was due to an Elastic Block Storage (EBS) service malfunction. Since Xeround’s database is in-memory and we only use EBS for checkpoints and redo logs, the recent outage did not interrupt running instances despite the fact that we were on the same availability zone.
With our users’ database service uninterrupted, let’s discuss what we did experience:
- A handful of newly registered users encountered problems when trying to create a new DB instance on our EC2-US-East datacenter and were directed to our other datacenters on EC2-EU-West and on Rackspace. This happened because we could not connect using Amazon’s API to purchase additional machines, and the space on our pre-provisioned servers had been exhausted. (BTW – besides Xeround, it seems that the only ones that survived the Amazon outage were those that massively over-provisioned – such as Netflix. However, for many of our users, the entire point of moving to a cloud database is to save cost, meaning they wouldn’t need to over-provision to ensure availability.)
- The Instances’ backups were unavailable, as Amazon’s EBS was unavailable and we provision EBS units for our backups.
Hopefully we won’t experience another large-scale cloud outage anytime soon, but we all have to face the facts: in the dynamic environment of the cloud – server crashes, hardware malfunctions and other manners of blips in the “availability” skies – these are all part of the territory. The trick is to expect it and to be able to address it in a way that is:
- Transparent – So that the application is unaware of it
- Immediate – So it won’t affect the availability of the service
- Painless – So you won’t lose sleep over availability issues or send your developer/DBA home with a headache every day trying to keep up with maintaining the service.
How does Xeround handle high availability?
Our native cloud service was designed with a deep understanding that maintaining high availability in the cloud is inherently different from maintaining high availability in your traditional on-premise datacenter. In the cloud, high availability isn’t just about hardware resiliency anymore. You can’t just plug in an extra power supply or network card, or swap hard drives, etc.
High availability in the cloud depends on:
- The availability of ‘more of the same’ resources
- The ability to dynamically provision them across any and all datacenters/configurations – be it within the same datacenter, across regions, across availability zones and even across cloud providers.
In the cloud, there isn’t much that can be done to lower the chances of a machine failing, but in case it does fail – or more accurately: WHEN it fails – you want to be able to seamlessly provision a new one on-the-go and maintain service.
You start the “resiliency-chain” by launching a new machine to replace the one that has failed on the same availability zone. If that fails, you’ll want to expand to another zone or region, and sometimes you’ll even want to have another cloud provider altogether to be on stand-by for failover purposes.
Another point worth mentioning is that high availability in the cloud is optimal for stateless applications, where the ability to spawn resources at will is very helpful and powerful. Thus, your architecture should be leaving the “statefull-ness” in one layer: the database. So, despite its “statefull” nature, the DB itself should be able to handle high availability.
What goes on in Xeround’s backend to enable high availability?
Xeround’s distributed architecture spreads the data across virtual partitions, with each single partition replicated across multiple servers. If a server fails, all the data is still available in the surviving replicas and the DB is operational. As the self-healing process kicks in, replacement servers are acquired on-the-fly and re-sync is executed from the remaining replicas.
Xeround’s backend is extremely flexible and supports various options for high availability. Our technology allows us to arbitrarily set the number of data replicas that we will manage as well as set the role each plays and the location each resides in.
Presently, our service is offered with two active-active replicas, both residing in the same data center (and availability zone) as means of protection against a server’s failure.
Down the road, we plan to have a multitude of high availability options, most of which we have already deployed on our on-premise realm:
- Geo-distribution for read-intensive applications (where write performance can endure latency) – we have successfully deployed setup on 3 remote DCs across the US, where writes are performed synchronously (thus the high latency) and reads are served from the local copies. In the cloud, this model is transposable to the availability zones of a single region with replicas managed at all zones offered within the region. Alternatively, the same can be applied to achieve high availability across regions.
- Active/Passive or Active/Active with automatic DRP – using this model, we manage 5 replicas of the database: 3 are in the primary datacenter and 2 at secondary datacenter. This allows us to withstand both failures of single machines at each site as well as a failure of an entire datacenter. Under normal conditions, writes are replicated synchronously in the primary site and a-synchronously to the backup site. In cloud terms, the different sites can be at different regions/datacenters or even cloud providers.
- Another, and perhaps simpler, approach we’re considering is having the automated backups we provide to our DB instances copied to other cloud providers, which would allow users to restore an instance on another cloud provider. Because Xeround is vendor-agnostic, we can run on any cloud provider, allowing you to seamlessly migrate between clouds with no need for code or architectural changes.
There are many options for deploying high availability DB architectures, and we believe the cloud provides many opportunities to deliver them. This allows the application layer to remain stateless and highly available on its own.
With Xeround’s self-healing mechanisms, failover is handled automatically, saving you the headache and configuration updates to ensure your DB is always on. I believe this is what the cloud is supposed to provide: freedom from tedious IT operations, which lets you focus on what really matters – the application itself (and your Spring break plans)!