Friday, October 26, 2012

A Case Study in Cloud Computing Dangers

"A cloud computing approach could save 50 to 67 percent of the lifecycle cost for a 1,000-server deployment." Kevin Jackson - Forbes.

It's not hard to understand why business executives are completely intoxicated by cloud computing.  For the uninitiated, cloud computing essentially allows organizations to outsource just about any IT processing to a third party. If you need new servers, then you can just go to Amazon to quickly and cheaply procure new server capacity that's available immediately. Sick of managing your internal email system? Go to Microsoft to get Exchange email, calendaring, instant messaging, and SharePoint with the click of a button. Want to gain access to enterprise-class back office accounting and support system? Check out Google Apps for Business and all of the add-ons that it makes available. An organization can get instant satisfaction by moving to the cloud while paying a small fraction of what it would cost to procure the equipment, software, and people to do it all internally.

Sounds great, right?  Look closer and you may not be so convinced.

This case study examines an outage of Microsoft's Office 365 cloud service in May 2012.  I was one of the first people to detect the nature of the problem and was part of my company's response to it. My business partner, Jason Shropshire, wrote a good brief synopsis in a blog posting during our response, but I believe that it's important to present a broader analysis of what we believe happened as it happened, how we responded, the obstacles and challenges that we faced, and my understanding of how the problem was resolved. Perhaps it will aid business executives, especially those in a more vulnerable small business context, in making more informed decisions about their cloud deployments.

You won't find any news coverage of what we considered a Denial-of-Service condition and I doubt that Microsoft would consider it to be an outage. But, this study offers a cautionary tale to small business owners (and probably large business owners as well) about the dangers of using cloud services.

What would you do if you could no longer email your clients?


One of the inefficiencies of modern consulting is the need to communicate across multiple email platforms. At the time of the incident, I was serving as an IT program manager under a Department of Veterans Affairs (VA) sub-contract. In what is an all too common operating environment, I maintained email communications across three different email system types: corporate, VA, and other corporate (multiple contractors). While we were required to follow the standard government practice of only using the VA email for project communications, system constraints and challenges in using government email systems while working remotely led us to generally use corporate email systems for "internal" team communications and then to copy our VA addresses when we wanted to send email to our VA clients.

As the incident unfolded, we interfaced with a variety of different platforms, email applications, and services throughout our troubleshooting and response activities. They included the following:
  • Email Applications: VA workstations typically had Microsoft Outlook 2007 on Windows XP; Corporate workstations generally used Outlook 2007/2010 on Windows 7 though I used Outlook 2011 on Mac OS X 10.6.8 (Snow Leopard).
  • Email Systems: Microsoft Exchange, Microsoft Office 365, Google Gmail, Network Solutions nsMail Pro.
Beginning at around midday on Wednesday, May 9, the Denial-of-Service event discussed in this case study spanned nearly seven days, time that was mostly spent trying to convince those providers with direct control over the root causes to accept responsibility for correcting the related defects. This case study will span several blog postings with each roughly corresponding to a day of incident response. Each posting is meant to educate as well as inform, providing detail into our response activities and our evolving analysis of what defects occurred in technology and process to cause the incident.

Jump to Part 2: Incident Detection