The Dueling Models of Cloud Computing
Until this past week, there's been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: "design for failure" and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.
The Amazon model is the "design for failure" model. Under the "design for failure" model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.
Most cloud providers follow some variant of the "design for failure" model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn't matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.
The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.
The advantage of the "design for failure" model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the "design for failure" model is that you must "design for failure" up front.
The rest of George Reese's post is as succinct on the AWS matter as anything I've read and I'd advise you to read it. To make it even simpler if it used to be on a physical system and you're installing it or P2Ving into a VM you're probably traditional. If you're developing and deploying it with Cloud Foundry you're probably designing for failure.
I'd also ask that you gaze in horror at the following post on the AWS Forums where a Cardiac Monitoring System was deployed into AWS and was then blown offline when the outage occurred.
They've been unable to monitor the status of their at risk cardiac patients for the past two days.
This clearly is a case where the service in question mistook or was told that AWS was the traditional model at a cheaper price and only found out otherwise on the 21st. That or they were blinded by the race to the bottom pricing they were getting and said if it was good enough for the big guns going from industry show to industry show prognosticating about how great it is and how they've "reinvented IT" then they wanted in.
While the criticality of the app is one thing the fact they had to go begging on the support forums is terrifying. There needs to be some upfront education here that if you don't have the engineering resources, development time and funding to design for failure you need to understand that with AWS you're on your own.
It is clear there are no individual customers, there is only the system. And with the system we're back to what I said about the nines in my last post, you can either do the extra work for them or you can't.
And if you can't, expect downtime.
