Sunday, January 13, 2013

Cloud Computing Dangers: Establishing Responsibility

This posting is Part 3 of the Case Study in Cloud Computing Dangers.

At around 4:30 PM on Wednesday, May 9, I was preparing to make the trek from my VA site location near DC's Union Station to my home in Fairfax City, VA. For anyone who isn't well versed in the journey, understand that it is something that you really need to psyche yourself up for. It wasn't uncommon for me to lose 90 minutes of my life making just the one way trip over the course of just 17 miles. Doing the math, I could travel at just a little over 11 miles per hour, covering a mile in perhaps 5 minutes. Knowing that you will never get that time back, that most of the time you'll be staring at dozens or hundreds of taillights, that you could probably cover the distance faster by bike if you didn't have to wear a suit, is an excruciating fall from innocence that I would promote as the contemporary definition of madness. You have to develop a dissonant optimism to keep from just barreling through a crowded street in a moment of temporary relief. "Maybe it won't be that bad today." "My kids will thank me some day for working so hard." "I'll be able to make soccer practice…no problem."

Jason and I both knew how critical our email communications were for maintaining business continuity. As a small business with less than a dozen revenue-producing employees, our position was tenuous and depended on the perception of always being present, available, and responsive. This problem had cut off our communications with our two largest revenue generators, representing over half of our active business, and with a contractor with which we were working on several proposals. We had to solve the problem, and fast. It seemed obvious to me that I should just break out my iPad and troubleshoot while navigating DC/Northern VA traffic. When Jason realized what I was doing, he simply cautioned, "Please don't kill yourself over this." At least I was able to justify not riding a bike to the office for another day.
Using my iPad, I was able to remotely access my VA email account over a Cisco remote access app while Jason proceeded to try different variations of email message delivery tests. We ran ping tests, traceroute routines, back-and-forth email submissions, all to no avail. Nothing seemed to work and nothing beyond my now multiple email delivery delay messages seemed to give us any indication of what was wrong. I supported all of the tests with nary a horn blown in my general direction.

At 5:43 PM, Jason reported back to the enthusiastic Microsoft customer support representative that the issue was persisting into the evening, pleading with him to help us identify and resolve the delivery failure.

Anyone can be a good IT engineer but only a select few have the experience, general knowledge, and imagination to diagnose really difficult problems. I've met two who people who excel in all of those areas. Jason is one of them. So, the fact that he was struggling with this problem increased my concern. 

At 10:05 PM, a new discussion thread popped up over at the Office 365 forums regarding delivery failures (the title has since been modified by the original poster (OP)). Jason discovered it the next day. Microsoft provided absolutely no tangible support to the OP, but at least it was clear that the problem was not isolated to our company's email domain.

By Thursday morning, Jason was only making limited progress with Microsoft customer support. He reported by late morning that Microsoft had been conducting email traces and could confirm that the email messages were getting properly delivered to both email domains that we had reported issues with. While the customer support rep concurred with our assertion that our messages were likely being blocked by the recipient email gateways as SPAM, he pointed back to us as being ultimately responsible for getting the problem fixed.

Let's briefly examine how a hosted email service solution functions before discussing the impact of the Microsoft response. When a company contracts with a cloud email service provider, be it Microsoft, Google, Network Solutions, or anyone else, it is basically contracting to allow the service provider to intercept all email destined for the company. MX records redirect all email traffic to the service provider's email gateways and the company essentially loses all control over the email traffic flow. So, if the company receives email at youraddress@companyname.com, then the provider accepts possession over @companyname.com network traffic. The only element of control that the company retains (if it knows enough to retain it) is the MX record. That's it, nothing else.

The Microsoft representative's response demonstrated one of our primary lessons learned: As a small company procuring a popular cloud service, any problem is our problem until we can prove otherwise. In this case, he assigned our company with the responsibility to reach out to the owners of the recipient email gateways and ask them to "whitelist" our company's email domain.

Unless you understand modern IT management you may not completely appreciate the impact of Microsoft's response. The customer support representative expected us to do was call up the organizations and explain that their systems are broken and we would like them to make a custom change to the system, just for us. It would only entail a simple operation of adding our email domain to a text file on each of the target email servers. But, with the wide disparity in email domains with which an organization communicates on an hourly basis, no one whitelists. No one.

Missing from the discussion was why we needed to request a whitelisting. Some condition on the Internet had to have changed in a moment to suddenly kill email delivery from our company to two separate email domains. There was just too much coincidence here to accept that both systems had the same problem at the same time. Jason and I were also skeptical that those two recipient gateways were the only ones we had problems with, thinking instead that they just happened to be the only gateways that we were aware of. 

We continued to convince ourselves that our company wasn't alone in dealing with this problem and started scouring the Internet for any evidence that this was something that wasn't just isolated to our company email. At around 12:45 PM, Jason had come across the Office 365 forum posting and established a private conversation with the OP. Finding each other was like throwing gasoline on a fire, we were all charged up to know that we weren't alone and frustrated at Microsoft's response to our joint problems. We started sharing the details of what Microsoft was telling us and what we were finding out on our own.

It was the forum OP who had discovered an important piece of information using a Message Trace function available to Office 365 administrators. After determining that the messages were indeed getting received by his target recipient (one that differed from our two problem recipients), he contacted the recipient owner to get assistance and whitelist his domain (following the same advice that Microsoft provided to us). The recipient refused and directed the OP to an organization called senderbase.org. It was that knowledge that enabled me to start developing the systems architecture that would then guide us in our response. I live for this stuff and the OP had just given me the drug I needed to be hopelessly hooked.

It's here that we really begin to dig into the underbelly of email communications. Sending an email can involve many steps to be successful. Once it is sent to the recipient from the senders email server, the recipient may channel each message through a variety of checks to filter the message for malicious software and SPAM. Whereas malicious software detection generally depends on the recipient server being able to scan the message for a "signature" that marks something about the message as bad, SPAM detection depends almost entirely on some scoring algorithm. That algorithm is typically driven by some external organization with the server owner determining what threshold of message to allow.  In the case of senderbase.org, the higher the independent score on a -7 to 7 point scale, the more trustworthy the sender is determined to be and the more likely the recipient server will allow the message to proceed. Senderbase.org also uses ranges of scores to develop a macro characterization of the sender's reputation, ranging Low, Neutral, and Good. Email server owners may use whatever setting they choose as the threshold for allowing email to pass as legitimate, but many filter out any message that have a score lower than "Good."

The OP had discovered through senderbase.org that the reputation of several Microsoft outgoing mail gateways had dropped into a "Neutral" state. Suddenly, everything made complete sense, at least to us. Something had occurred to cause a significant reputation shift on Wednesday that could have an adverse impact on every email server incorporating senderbase.org scoring into its SPAM filtering algorithm.

Jason and I immediately went into action. Jason contacted senderbase.org and I put in help desk tickets for the two recipient domains that we were unable to communicate with. Now that we knew what the problem was, surely we would be able to rally everyone to help fix the problem. But, perhaps, I shouldn't call you Shirley.

Jump to Part 4: Pointing the Finger