Here is how we have remained up despite not having a single ops person on our engineering team:
1) Our services are well monitored
We rely on pingdom for external verifcation of site availability on a world wide basis. Additionally, we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. Most of this data comes from AWS Cloudwatch monitoring but we also track error ratesand have alarms setup to alert us when these rates change or go over a certain threshold.
2) Our services have circuit breakers between remote services that trip when other services become unavailable and we heavily cache data
When building our services, we always assume that remote services will fail at some point. We've spend a good deal of time investing in minimizing the domino effect of a failing remote service. When a remote service becomes unavailable the caller detects this and will go into tripped mode occasionally retrying with backoffs. Of course we also rely on caching read-only data heavily and are able to take advantage of the fact that the data needed for most of our services does not change very often.
3) We utilize autoscaling
One of the promises of AWS is the ability to start and stop more servers based on traffic and load. We've been using autoscaling since it was launched and it worked like a charm. You can see the instances starting up based on the new load in the US West region as traffic was diverted over from US East.
(all times UTC)
4) Our architecture is designed to let us funnel traffic around an entire region if necessary
We utilize Global Load Balancing to direct traffic to the closest region based on the end-user's location. For instance, if a user is in California, wedirect their traffic to the US West region. This was extremely valuable in keeping us fully functioning in the face of a regional outage. When we finally decided that the US East region was going to cause major issues, switching all traffic to US West was as easy as clicking a few buttons. You can see how the requests transitioned over quickly after we made the decision. (By the way, quick shout-out to Dynect who is our GSLB service provider. Thanks!)
(all times UTC)
Bumps and Bruises
Of course we didn't escape without sustaining some issues. We'll do another blog post on some of the issues we did run into but they were relatively minor.
After 3 years running full time on AWS across 4 regions and 8 availability zones we design our systems with the assumption that failure will happen and it helped us come through this outage relatively unscathed.