Scale for Speed and Availability

In this post I’ll go over various options for scaling your business web platform. We’ll take a look at five different approaches. There is no wrong or right approach, it is just a matter of what aspects you want to emphasize and what your real world needs are. I’ll be using the Amazon stack in examples as it is my preffered stack of choice, but the strategies shown here apply to every other competing stack as well.

First let’s go over some concepts: region and availability zone.

con-az
Image from Amazon

Amazon Availability Zones are distinct physical locations that have Low latency network connectivity between them, are located inside the same region and are also engineered to be insulated from failures that happen to afflict other AZ’s. Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable; they have Independent power, cooling, network and security. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate; such that even extremely uncommon disasters like fires, tornadoes or flooding would only affect a single Availability Zone.1

If your platform is working mostly in one area of the world it makes sense to put your servers in that region. The region will then have mulitple “Availability Zones”. This means that you can put redundant servers in different zones within same region and as a result you’ll have better availability. The important twist here is that within the same region network latency is minimal. So we have separate facilities with good interconnectedness. Here is an image of the available regions along with the number of available zones on AWS. Two green circles are new regions that are opening soon (in Paris and Ningxia).

global_infrastructure_12-15-2016
Image from Amazon

(#1) Scale Up

Things are simple here, you have one machine that serves all your traffic. When you notice that the server can’t handle the traffic you simply shut down your machine, upgrade the CPU, RAM and storage and you run it again. This approach is the cheapest and is ideal for the MPV state. Don’t be fooled, though, it still can get you very far and I would most definitely always begin with this approach.

Cons:

  • single point of failure (all eggs in one basket)
  • downtime when upgrading
  • you can’t adjust dynamically (for spike traffic)

Pros:

  • simplest
  • cheapest (up to a point where you could have a cheaper setup that’s also to adjust to dynamic traffic spikes and downgrade at other times)

(#2) Scale Out – Single Availability Zone

This is similar to having a single machine in the sense that if our availability zone goes down, production goes down too. So our server(s) live in single region inside of a single availability zone. We are merely adding an Elastic Load Balancer that distributes traffic to multiple servers within the same availability zone.

Cons:

  • single point of failure

Pros:

  • possible to upgrade without downtime (multiple servers)
  • possible to adjust dynamically (for spike traffic)

It is much better to use the approach #3 with multiple zones. This can be used when the load is so low it requires only one server (so it has to be in one availability zone) as a stepping stone in the right direction.

(#3) Scale Out – Multiple Availability Zones

Amazon EC2/RDS instances have an uptime guarantee of 99.95% on a monthly basis. The max permissible downtime roughly equates to 22 minutes per month (assuming 30 days per month)2

When we combine multiple availability zones it means it makes it very unlikely we will have an outage. Elastic Load Balancer can detect problems in each zone and redirect traffic to healthy instances.

screen shot 2017-02-17 at 14 49 19

Cons:

  • affected by whole region going down

Pros:

  • possible to upgrade without downtime (multiple servers)
  • possible to adjust dynamically (for spike traffic)
  • possible to survive one or more availability zones going down

This combination is a sweet spot for reasonable realiability and cost.

(#4) Scale Out – Multiple Regions – Active/Passive Failover

Although it is very rare for an entire AWS region to go down, it does happen. Many enterprises want to replicate their databases across regions, so that when a catastrophe does occur and the primary region goes down, infrastructure can be quickly setup in another region.3

Such a setup requires the database to be synced across regions. Total time from end point failure to DNS failover is about 3 minutes, so we can have a backup server running soon, preventing big outage.

screen shot 2017-02-16 at 14 48 30

One possibility to cut down cost is to use a passive setup as staging area for testing prior to production rollout.

Cons:

  • partially affected by whole region going down
  • we need read replicas in different region for havoc scenarios

Pros:

  • possible to upgrade without downtime (multiple servers)
  • possible to adjust dynamically (for spike traffic)
  • possible to survive whole region going down with little to no down time

(#5) Scale Out – Multiple Regions – Active/Active Failover

When your server handles lots of customers across multiple regions it makes sense to keep both regions active. In normal circumstances you might use Amazon Route 53 Latency Based Routing (LBR) or Weight Round Robin (WRR) to distribute load. In case of emergency when an entire region goes down you transfer the traffic over to a working region. This means you get slower responses, but it certainly beats suffering complete downtime. The configuration is exactly the same as #4 Active/Passive Failover but we use both regions and we distribute the load between them at all times, not just in case of one region going down.

Cons:

  • we need read replicas in different regions
  • we probably need a database master in each region

Pros:

  • possible to upgrade without downtime (multiple servers)
  • possible to adjust dynamically (for spike traffic)
  • should survive whole region going down without major issues
  • allows region by region rollout to test new production

Common Concerns

For a big system, a major problem is always the database. So in a sense you do everything you can to remove the burden from it:

  • Read Replicas
  • Caching of static and dynamic content
  • Splitting data based on regions (multiple masters depending on region)

Another good tip is protecting web servers from being burdened by using a CDN for static content delivery or streaming.

DDOS protection is another valid concern.

Conclusion

Congratulations on making it all the way here. If you just jumped here, shame on you, otherwise I hope you found this useful 🙂

If you are in search of an awesome RoR team, or you need help with setting up your project you can ping us here.

Sources used: