Friday, February 8, 2013

Cloud Computing Dangers: Blame When Things Go Wrong

This posting is Part 7 of the Case Study in Cloud Computing Dangers.

When technology problems occur, IT folks will typically focus first on finding a technical solution. It's in our nature because solving technical problems is what we've been trained to do. Waking up on Sunday, May 13 to find ourselves still suffering from an outgoing mail Denial-of-Service on our Office 365 business platform, we were in disbelief that the technical problem still had not been solved. Our challenge was to move past our confidence in understanding the problem's technical nature and to recognize that we were falling victim to a broader issue of being unable to assign responsibility in a massively distributed communications system.

Prior to subscribing to a cloud service, organizations need to understand that quality of service is dependent on a variety of factors that are outside of the organization's and the cloud service provider's direct control. In this case, those factors include network routing services such as Domain Name Service (DNS) that translates a domain (e.g. example.com) to a numeric address (i.e. 192.168.XXX.XXX), mail exchange (MX) records for email communications, and the gateways for other organizations that receive email. If you were to examine the problem from a Microsoft perspective, you would likely find that the Office 365 outgoing mail gateway had no technical defect that prevented it from carrying out its assigned function. It correctly distributed email as the mail server directed and the recipient email gateway received the email as expected.

Unless there was indeed an abnormal spike in traffic on the outlook.com mail gateway that transfers outgoing mail for Office 365 customers, Microsoft had actually done everything right. From a technical perspective, the problem rested in the hands of the recipient gateways that were rejecting the email messages. But, they too were functioning normally, simply adhering to the predefined policies that were allowing the recipient to receive our email messages one day and then rejecting them the next. Their policy enforcement mechanisms were based, at least in part, on the reputation score received from senderbase.org. Since senderbase.org likely generated the outlook.com reputation score based on an industry-accepted automated algorithm, then it was probably functioning normally as well. I have the sense that if M.C. Escher were to redraw his famous "Relativity" lithograph today, he could base it on modern IT customer support and how following subsequent troubleshooting steps will simply lead you back to your initial examination point.

Like a child that goes silent when accused of a wrong, a natural IT response to any technical problem is to disavow the problem completely. If all of the technology was working normally then there's nothing left to do but point the finger at someone else. Several years ago when working on a very large back office system integration effort, my security engineering team was tasked with developing a single sign-on solution that would span eight different commercial software installations. While the solution was one of very few successes on that project, it also became a target any time a developer found a problem with their coding. I lost count of how many times my team was told to "turn off that security stuff because it's breaking my application." Of course, turning off the security didn't correct the problem, but the example illustrates how IT-oriented people will first look for blame outside of their own areas of influence before looking internally for an answer.

In the information security space, we recognize that one of the most vulnerable points in any system is where one component interacts with another. The same is true in massively distributed networks, the space in which cloud computing services operate. When the problem occurs at an intersection point, then the finger pointing comes from both sides of the intersection where one side blames the other for any problem. In our case, the problem resided at one of those service intersection points, specifically where an email message transmitted by one party was received by the other party. If it was transmitted correctly, then the problem must be in receipt. If it was received properly, then the problem must be in transmission. There are no winners in this dispute, only losers in the form of the parties subscribing to the opposing services.

The fact that no singular technical failure occurred should probably serve to distribute blame rather than point it in any one direction. The most direct cause, the reputation score assigned by senderbase.org, needed to be corrected. When senderbase.org wouldn't respond to our inquiry, we discovered through more digging that the organization is really just a front for Cisco systems, one that feeds the reputation scoring for all Cisco email appliances across the Internet. By throwing yet another big company into the mix, we simply managed to dig a deeper hole that a small business like ours lacks the connections to dig out of.

When dealing with cloud services, it's important for organizations to understand that even when everything is working the way its supposed to it doesn't necessarily mean that everyone is on the same page. This incident wasn't caused by a technology failure or an outage that makes it easy to assign blame, but by the fact that no one had the motivation nor the desire to work together to address the intersection problem. Even after you've determined who to correctly assign blame to, you have to convince that organization to accept it.

My argument is that when subscribing to any cloud service, the service provider should accept responsibility for fixing any problem that occurs behind the service boundary. For example, if a service provider were to leverage a third-party single sign-on function that enabled an organization to access subscribed services and that function were to become unavailable, the provider should be responsible for engaging with the function owner to fix the problem, not the organization subscribing to the cloud services. Similarly, since Microsoft takes control of all email function once it receives an outgoing email from a subscriber, it should accept responsibility for working with Cisco to solve our outgoing mail denial-of-service condition. Not only did our organization lack any position to directly engage Cisco (nor did any of the others also suffering from the outage), we had no ownership over the part of the email distribution process that was failing. As I stated on Sunday afternoon, May 13, "Our business is with Microsoft and the service that Microsoft is providing…is currently at risk…It is Microsoft's responsibility as our service provider to represent our interests in dealing with [Cisco]."

We moved into Day 5 of this outgoing mail Denial-of-Service with only word that Microsoft was "all hands on deck" with visibility at least to the VP level. My contact at Office 365 echoed that statement and added that even mentioning it to someone involved causes a raucous and caused him to back off. Despite feeling a little bad that my pal may have gotten a little burned, I suppose that I took some solace in the fact that we weren't the only people with a fire lit under us.

Update: I originally attributed the reputation scoring to sendmail.org due to an extended inability to correctly transcribe my notes to the keyboard. The references has since been corrected to senderbase.org as the reputation scoring authority that Cisco devices use by default.

Jump to Part 8: A Case of the Mondays.