Friday, October 26, 2012

Cloud Computing Dangers: Incident Detection

This posting is Part 2 of the Case Study in Cloud Computing Dangers.

At around Noon U.S. Eastern Daylight Time (EDT) on Wednesday, May 9, I forwarded a calendar invite from my corporate account to my VA address. The message included some important attachments that originated from a prime-contractor colleague. I also responded to several email messages from the same colleague, sending mail to both his corporate and his VA accounts. Everything that seemed to have worked fine a few minutes prior was about to blow up in my face.
After several minutes, I noticed that my first message had not arrived in my VA inbox. This wasn't a surprise since the VA email gateway regularly sends email messages with attachments into the proverbial "black hole." I assumed that my first message met that fate, but I also understand that there are many ways in which message delivery can fail, including: message size (the attachments were too large), message type (some filename extensions may be blocked), VA gateway rejection (something about the message caused the VA email system to block it), corporate email transfer issue (some point between my email application and our corporate server had an issue). In the last two cases, I would generally expect to get some feedback error, either in the form of a "bounce," an email message rejection notice, or an error shown in my email application, Outlook 2011 for Mac. Since I hadn't received any feedback, I decided to tackle the first two scenarios as most likely. So, I distributed the attachments across multiple messages and sent each from my corporate account to my VA account individually. If any one made it through, then I could rest easy that everything was working as expected and that I had just sent a non-conforming message. Otherwise, I could be sure that there was a problem outside of my control.

On a hunch, I logged into the web application interface for our cloud email service - Office 365. My troubleshooting pointed to the root problem being between the email application layer and the recipient email system gateway. To send email from an application on any given computer, most users have to configure settings for both the incoming server, the computer system that contains your email, and the outgoing server, the computer system that receives and transmits email messages. Since my combination of Office 365, Outlook, and Mac OS X had frequent communications errors (go figure Microsoft and Apple products/services not working well together), I hoped that I was having a communication problem between my computer and the server. I could access the web application to see if my failed messages had landed in the "Sent" folder. I reasoned that if the folder did not contain the messages, then the communication from my computer and Office 365 was deficient, representing a defect that I would be able to repair. Alas, all of the messages that I had sent were residing in the "Sent" mail folder and I still had no feedback regarding a problem.

By 1:30 PM, my VA account was still not receiving any email messages that I sent, with or without attachments, but I was able to send and receive email between my corporate account and Gmail, and between my VA account and Gmail. That troubleshooting step confirmed that both email systems were functional. After having some conversations with a colleague also using our corporate email system, I confirmed that the problem extended to other users of our corporate system. I checked in with colleagues using other corporate email systems and found that they seemed to be communicating normally but were also not receiving email from our corporate system. After some more troubleshooting, we determined that their email messages came through normally into our system, so the problem was one-way: from our corporate system to some others.

At 2:40 PM, I reported the problem to our Office 365 leads. My business partner, Jason Shropshire, jumped on it and began checking into the problem himself.

At 3:29 PM, I received my first of what would be many email delivery error messages.  The message was flagged as "delayed delivery" versus an outright failure. That would explain why I hadn't received any feedback prior. Email systems are usually configured to continue attempting to deliver a message when the first attempt fails but the system appears to be functioning normally and the message isn't outright rejected by the recipient gateway. So, the system was continuing to attempt delivery but receiving no acknowledgement from the VA or other contractor email systems. While the event itself isn't uncommon (email systems are remarkably error prone), the fact that it was happening when communicating with at least two systems gave me pause.

I had two suspicions at this point. My first was that our email messages were being blocked at the recipient gateways as SPAM. Given that ours was a small company with limited resources, I had always feared that we could be subject to an attack that would drop our reputation score. But, after some checks online, I didn't see any issues with our reputation, so I moved to my second suspicion, which was a problem with our email routing.

Email routing is a lot more complex than most people would think. You have to have a pretty good idea of how the Internet works to really understand it. I would not have considered routing to be a problem had Jason and I not encountered a similar but unrelated client issue in 2011. In that instance, our client, a small business with very limited IT capabilities, started having critical email delivery issues on a Monday. After working through the standard troubleshooting steps, the client informed us that its telecommunications service provider had been in the office the prior Friday afternoon to do "routine maintenance." That routine maintenance had been to completely change the network equipment and Internet address for the company's service.

While the standard consumer Internet service is configured to regularly cycle Internet addresses, business service usually includes the use of a dedicated set of unchanging addresses. For any business, changing the Internet address can have dire consequences that should be prepared for in advanced. In some cases, the business may have direct point-to-point communications configured with other offices or customers that depend on maintaining a permanent address.

Our client company was maintaining an email system on the local premise. To receive email, some agent of the company had properly registered the company's Internet address as part of the Mail Exchanger Record (MX Record) for the company's email messages. The MX Record propagates around the Internet to tell any Internet-connected email system how to deliver email to the company. Once the service provider changed the company's Internet address, it effectively caused all email going to the company to hit a wall and bounce back to the original sender. That was a critical error on the part of the service provider.

Once we were able to correct the MX Record, the company continued to have some problems with mail disappearing without a trace. After digging much deeper, we discovered that the email server had been marking messages as SPAM that had been allowed in the past and the same was true of messages the company was transmitting to other partners. We suspected that changing the Internet address of the company's email server represented a condition that caused SPAM filters to drastically change their scoring algorithm for email messages originating and received by our client's server. That condition remained until we migrated the client to Office 365.

Given the behavior that Jason and I were seeing in our mail message delivery, we keyed into a recent event that may have changed the way that SPAM filters scored our company's email. As a customer of Microsoft's earlier hosted email solution, Business Productivity Online Standard Suite (BPOS), Microsoft forced our company to migrate from BPOS to Office 365. That migration happened in stages and, from an interface perspective, had already occurred several months earlier. When I started to detect email issues, Jason and I both recalled that the final stage of the migration was to have happened within the prior week. That gave Jason an avenue for further technical analysis.

Jason reported the issue to the Microsoft help desk and determined that there had been some issues with our company's MX record. They weren't issues that really should have caused a problem, but it seemed like a good idea to fix them. So, he did, and Microsoft explained that it would take an hour for the fix to propagate across the Internet. At 4:34 PM, a representative from Microsoft Customer Support cheerfully announced via that the problem had been resolved. He was wrong.

The problem would take the better part of the next week to be resolved. During that time, Jason and I would ultimately dedicate much of our time to analyzing, examining, prodding, and pushing for a resolution to a problem that no one was willing to accept responsibility for. We, and the company that we represented, had become indirect victims in the cloud computing revolution, an event that marks a nexus between ability to correct defects and the need to depend entirely on organizations that have no incentive to correct them. It was a humbling experience.

Jump to Part 3: Establishing Responsibility.

No comments:

Post a Comment