This Is Why You Use Amazon

February 20, 2008

Amazon had an outage of Amazon’s S3 storage on Feburary 15, 2008.  The outage lasted about 2.5 hours.   This is bad news.  Anytime you rely on a services provider for a key component of your website, having that component go down means that you are out of business.  However, let’s look at Amazon’s description of what caused the problem:

 ” Here’s some additional detail about the problem we experienced earlier today.

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests.  Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls.  The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.  In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles.  This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.  By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable.  As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements.  We are taking immediate action on the following:  (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls.  Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

Sincerely,
The Amazon Web Services Team “ 

Link to Amazon’s response

Let’s dissect this so you can understand why I am saying that you should still use Amazon.  First, the outage was just 2.5 hours.  Considering that this impacted a large portion of the S3 service this is a very quick response time.  It shows that Amazon has the manpower and resources in place to address serious problems.  Second, they already saw the problem coming and were bringing online additional capacity before it impacted customers.  Unfortunately, the capacity could not be brought online fast enough so that customers would not see the problem.  Third, they posted notices for their customers quickly.  Unfortunately, it was on the Developer forum and not in a prominent enough place so that customers could quickly get a status.  Finally, they posted the details of their follow-up plan to keep this from happening again.  It included improving monitoring so that they can discover problems even faster, increasing capacity where authentication services failed, improving defensive measures around the effected area (rate limiting?), and setting up a dashboard so customers can more easily see what is going on.

Now let’s think of this in terms of you.  How fast could your company have solved the problem, brought new capacity online, communicated with your customers, and come up with a follow-up plan?  I can honestly say that we would have had a hard time matching this performance in-house.  Just bringing unexpected capacity online quickly is a thorny problem.  Amazon is doing what they are being paid for.  I don’t mean just providing you storage services and CPU capacity.  I mean providing you first-class support around those services.  If you think you can do a better job then leave Amazon.  If not, as I expect the majority of customers will realize, then stay.   In the end, this issue has brought to people’s attention that they need to think about disaster recovery and business continuity when doing their planning.  Then again, they should have been thinking about this already.  In the end Amazon’s problem will make the smart companies stronger companies because the smart companies will analyze where they could have done a better job and execute.

- Doug Kersten


How to make a customer angry!

January 15, 2008

Today I went to Network Solutions to lookup some new domain names to register.  I searched a bit and found one I liked.  I then went to my hosting provider to register the name since they have completely automated the process and all that I have to do is click a button to add a domain.  I have been doing this since the late 90’s.  Same process and it always worked.  This time it failed to work.  The domain I wanted had been taken by someone else within a few seconds.  Oh well, what can you do, I thought.  Until I found out who took the domain name.  It was Network Solutions.  They had registered the name and not wanted me to register the name with them, at over twice the price!  It was a holdup!  I have not been so angry at a company for a very long time.  It stung even more because Network Solutions is such a big force in the domain name business, I could feel the cold, hard steel muzzle of the gun pressing into my back.

I called Network Solutions to express my dissatisfaction and all they had for me was a lame excuse about someone else out there might take my name so instead of letting the other person take it Network Solutions took it.  What a horrible argument!  We don’t want someone else to hold you up so we are going to instead.  I have been going to Network Solutions for years, even when I wasn’t buying anything I felt that at least I was sending traffic their way.  I have recommended to others to go there also and I am sure they bought domain names from Network Solutions.

In the end, Network Solutions agreed to release the name in four to six hours.  Unfortunately, now I don’t trust them and if the name is immediately re-registered by someone my belief is that that someone is Network Solutions or someone associated with them.  The customer service rep on the phone did an excellent job with the company spin, she deserves a raise for spewing it so diligently, but even she actually said, “Don’t come back to Network Solutions and look for the name again after we release it or it will be registered again by us.”

Imagine that, a customer service rep telling a customer not to come back…amazing!

I won’t be going back to Network Solutions again, nor will I be recommending the site to others when they ask me how to register a domain name as long as this policy is in place.  Network Solutions can you spell the name of the company who will be taking your place in the near future?  G O  D A D D Y!

- Doug K.