Wednesday, January 16, 2013

Cloud Computing Dangers: Pointing the Finger

This posting is Part 4 of the Case Study in Cloud Computing Dangers.

All businesses face significant IT challenges but they are far more insurmountable for small businesses with limited resources with which to tackle them. Cloud computing in any form, be it Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), Software-as-a-Service (SaaS), or WhateverYouImagine-as-a-Service (WYIaaS), promises to level the playing field by providing small businesses a level of enterprise support that they couldn't possible retain individually, all at a "low" regular subscription fee (at least lower than the alternative CapEx/OpEx values). With the level of support that a small business receives from a large organization such as Microsoft, the business should reasonably expect to have a much more available and resilient resource than it could expect of itself. Most business executives can easily see the benefits and are generally eager to sign up.

As someone who has run an IT operations group, I can tell you that IT people immediately blame the user when the user reports a problem. Perhaps its driven by pride in the environment that they maintain or by some sense of self preservation. For whatever reason, the user is wrong until proven right. You can see the results of this in large business help desks that immediately try to pass you off to an online "knowledge base" or threaten you by offering to "take away your computer" to examine the problem deeper. If the problem is an outlier, then it is more likely related to the user than to the system or application. That culture of denial is enhanced in a cloud environment where the service provider knows how to run the system much better than any individual user, so if it doesn't detect a problem, then there is no problem.

But, what happens when a small business user does detect a problem? In relation to a popular cloud service like Office 365, the small business user amounts to little more than a gnat to customer support; merely a minor annoyance to dispatch as quickly as possible. At best, since the services to which they subscribe are so complex, customer support will find some easy but unrelated problem to fix, tell the user that the problem has been corrected, and then request that they wait an hour or so for the fix to take hold. They then close the ticket and pretend that you never came, frequently ignoring any follow-up questions or requests. At worst, small business users are only able to submit help requests through an online forum and threatened to be charged a premium cost to request assistance by phone. 

It's only when the users begin to swarm that the organization will begin to take notice of the problem. That marks a major paradigm shift for most business owners and executives. In a traditional IT management scenario where the organization controls its systems and applications, when a user identifies a problem, a quick call to "the" IT guy or help desk line can usually get things moving. If that doesn't work, then a call from the line executive will usually do the trick. Small businesses are accustomed to being nimble in their responses to issues, depending on individual drive and motivation to push for solutions. Depending on others to make the argument that a problem is important is not in the small business DNA and represents a significant conflict point.

When I had the privilege of being responsible for the IT infrastructure of a growing late-stage financial services startup, other members of the executive leadership team expected me to respond quickly to any issues, often with frequent reports to the CFO, to whom I directly reported, and the CEO as to the progress made to resolve those issues. Following an email outage one Saturday night, my IT director and I were in the office overnight, personally struggling with the servers to diagnose and ultimately correct the cause for the outage. To translate those activities to the cloud era, my role would have been converted into one where I was merely an observer rather than a direct participant.

In a recent meeting, I characterized the challenge for a small business infrastructure manager in this way.  When the CEO comes for a progress report on an internally-managed infrastructure issue, the manager may respond, "We've identified the root cause and are actively working with the vendor to determine how to correct the problem." The alternative response to a cloud infrastructure issue may be, "I've posted the incident on the vendor forum and the initial response didn't solve the problem, so I'm waiting for another response and trying to see if anyone else is having the same problem." Whereas the former scenario represents an active response that demonstrates the manager's ability to affect progress, the latter illustrates how helpless the manager is and potentially how irrelevant the position becomes in the cloud era.

While we were sure that there was a problem with Office 365 and we knew that we were not alone, Jason and I then had the challenge of getting someone to listen to us. That proved to be a significant obstacle because of our inability to have a direct impact on the solution. By late morning on Thursday, May 10, Microsoft had yet to accept ownership over our problem as our service provider, claiming instead that it had been "solved." Senderbase.org was unresponsive to our requests to investigate the reputation problem, probably because we did not own the domain with the problem. And the owners of the recipient email domains rejected our help desk tickets as not their problem, pointing fingers back to Microsoft as our service provider. We realized a painful fact about cloud computing; as a small business, we had almost no standing to motivate anyone to fix anything.

Faced with a potentially maddening iterative cycle of pestering support personnel at four organizations, we decided to look at the problem from a different perspective. We needed to think about it from a "cloud" angle and leverage the cloud to promote a more effective response. Jason thought to crowd-source the problem via Twitter. By leveraging strategically sourced handles and hashtags, Jason began a public shame campaign to get the attention of key players and identify new channels for raising the issue. I mined the Office 365 forums to find more folks who seemed to be having similar issues and asked them to move over to our thread to build mass. Then, together with the OP of the forum thread that we had discovered earlier, we used the thread as a primary conversation area to increase response numbers and push the thread to the top of the forum's active discussion list.

All of those efforts took time with Thursday passing without any positive response from any of the organizations directly involved. We were all frustrated by the lack of interaction from Microsoft as we entered Day 2 of being unable to communicate with our largest customers. Would Microsoft have missed us if we had left for failure to deliver a service? Probably not.  But, late Thursday night, we received reports from one of the problem recipient domains that it was also having problems communicating with another large government agency. We were on to something but we still needed to find a way to build an argument for getting Microsoft to accept responsibility for resolving the issue.

Our efforts started to bare fruit on Friday, May 11. By late morning, we had successfully gathered five small businesses on the Office 365 forum that were suffering from similar outages, consolidating all of our help desk tickets into each of our individual requests to encourage Microsoft to begin treating it as a joint issue with the Office 365 email service.  At 2:37 PM, the OP wrote a posting stating, "Microsoft is now acknowledging the problem" and that we should expect resolution within a few hours.

As we entered Day 3 of our outgoing mail Denial-of-Service, we finally had an indication that Microsoft had accepted ownership over the problem. It was one small step forward after several backward but we tried to be confident that we had turned the corner.

Jump to Part 5: Seeking Problem Clarity