This Is Why You Use Amazon
Amazon had an outage of Amazon’s S3 storage on Feburary 15, 2008. The outage lasted about 2.5 hours. This is bad news. Anytime you rely on a services provider for a key component of your website, having that component go down means that you are out of business. However, let’s look at Amazon’s description of what caused the problem:
” Here’s some additional detail about the problem we experienced earlier today.
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
Sincerely,
The Amazon Web Services Team “
Let’s dissect this so you can understand why I am saying that you should still use Amazon. First, the outage was just 2.5 hours. Considering that this impacted a large portion of the S3 service this is a very quick response time. It shows that Amazon has the manpower and resources in place to address serious problems. Second, they already saw the problem coming and were bringing online additional capacity before it impacted customers. Unfortunately, the capacity could not be brought online fast enough so that customers would not see the problem. Third, they posted notices for their customers quickly. Unfortunately, it was on the Developer forum and not in a prominent enough place so that customers could quickly get a status. Finally, they posted the details of their follow-up plan to keep this from happening again. It included improving monitoring so that they can discover problems even faster, increasing capacity where authentication services failed, improving defensive measures around the effected area (rate limiting?), and setting up a dashboard so customers can more easily see what is going on.
Now let’s think of this in terms of you. How fast could your company have solved the problem, brought new capacity online, communicated with your customers, and come up with a follow-up plan? I can honestly say that we would have had a hard time matching this performance in-house. Just bringing unexpected capacity online quickly is a thorny problem. Amazon is doing what they are being paid for. I don’t mean just providing you storage services and CPU capacity. I mean providing you first-class support around those services. If you think you can do a better job then leave Amazon. If not, as I expect the majority of customers will realize, then stay. In the end, this issue has brought to people’s attention that they need to think about disaster recovery and business continuity when doing their planning. Then again, they should have been thinking about this already. In the end Amazon’s problem will make the smart companies stronger companies because the smart companies will analyze where they could have done a better job and execute.
- Doug Kersten