How Bizo survived the Great AWS Outage of 2011 relatively unscathed...

Donnie - 21 Apr 2011

The twittersphere, techblogs and even some business sites are a buzz with the news that the US East Region of AWS has been experiencing a major outage. This outage has taken down some of the most well known names on the web. Bizo's infrastructure is 100% AWS and we support 1000s of publisher sites (including some very well know business sites) doing billions of impressions a month. Sure, we had a few bruises early yesterday morning when the outage first began, but soon after then we've been operating our core, high volume services on top of AWS but without the East region.

Here is how we have remained up despite not having a single ops person on our engineering team:

Our services are well monitored

We rely on pingdom for external verifcation of site availability on a world wide basis. Additionally, we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. Most of this data comes from AWS Cloudwatch monitoring but we also track error ratesand have alarms setup to alert us when these rates change or go over a certain threshold.
Our services have circuit breakers between remote services that trip when other services become unavailable and we heavily cache data

When building our services, we always assume that remote services will fail at some point. We've spend a good deal of time investing in minimizing the domino effect of a failing remote service. When a remote service becomes unavailable the caller detects this and will go into tripped mode occasionally retrying with backoffs. Of course we also rely on caching read-only data heavily and are able to take advantage of the fact that the data needed for most of our services does not change very often.
We utilize autoscaling

One of the promises of AWS is the ability to start and stop more servers based on traffic and load. We've been using autoscaling since it was launched and it worked like a charm. You can see the instances starting up based on the new load in the US West region as traffic was diverted over from US East.

(all times UTC)
Our architecture is designed to let us funnel traffic around an entire region if necessary

We utilize Global Load Balancing to direct traffic to the closest region based on the end-user's location. For instance, if a user is in California, wedirect their traffic to the US West region. This was extremely valuable in keeping us fully functioning in the face of a regional outage. When we finally decided that the US East region was going to cause major issues, switching all traffic to US West was as easy as clicking a few buttons. You can see how the requests transitioned over quickly after we made the decision. (By the way, quick shout-out to Dynect who is our GSLB service provider. Thanks!)

(all times UTC)

Bumps and Bruises

Of course we didn't escape without sustaining some issues. We'll do another blog post on some of the issues we did run into but they were relatively minor.

Conclusion

After 3 years running full time on AWS across 4 regions and 8 availability zones we design our systems with the assumption that failure will happen and it helped us come through this outage relatively unscathed.

comments powered by Disqus

Previous Post Next Post