Friday, February 1, 2013

Cloud Computing Dangers: False Hope

This posting is Part 6 of the Case Study in Cloud Computing Dangers.

On Saturday, May 12, as my company continued to suffer from an Office 365 outgoing mail Denial-of-Service, I woke up to an email a colleague sent me from the primary contractor that we were unable to communicate with. A test message that I had sent at 3:33 PM on Thursday, May 10 had been received at 2:24 AM Saturday morning. Despite a transit time of just under 36 hours, I was elated to discover that a message had gotten through. Perhaps Microsoft was really true to its word and we could expect to have the problem resolved soon so that we could move on with our lives. Or, perhaps it was just a fluke since I hadn't seen any other messages get through.

After a couple more hours of unproductive waiting just increased my desperation, I realized that the business process problem that we were falling victim to was likely formidable and had a high potential for being repeated. I posted on the Office 365 forum that we were using to coordinate efforts with other small business customers my belief that SPAM reputation scoring may represent a single point-of-failure for general email distribution that may span email vendors, stating, "If someone at a higher architecture level isn't made aware of the problem, I'm willing to bet that we'll continue to see it happen and small organizations such as ours will continue to suffer through multi-day denial of communications services situations."

Based on that assertion and my decreasing confidence that Microsoft's Office 365 customer support staff would be able to solve the problem without a little more of a nudge, I reached out to an old fraternity roommate who happened to have just moved over to Office 365 to support enterprise clients. I admit to it being a bit of a desperation move to call in a relationship favor, but since corporate executives do it, I figured that I would give it a shot. One of these days, I'll be happy to repay the favor in kind if I'm ever in a position to do so. My pal responded later in the day saying that he hadn't heard about the issue but would look into it for me.

I felt at the time that I may have been operating from a bit of an exaggerated perspective but nothing seemed to prove me wrong at the time. My suspicion at the time was that the shift from BPOS to Office 365 that I discussed before caused a problem with the reputation score. Based on my architectural reasoning combined with the shift from the old BPOS gateway to the Office 365 outgoing mail gateway, I figured that may have realized a significant sudden increase in outgoing mail traffic. A reputation scoring algorithm could conceivably attribute such an increase in traffic to a potentially active SPAM condition that would result in a lowered reputation score. Perhaps the score would normalize at a higher level again over time, but I believed that the timeframe for a natural normalization would be much longer-term than we were willing to accept as we moved into Day 4 of our outgoing mail denial-of-service. 

The OP from the Office 365 discussion forum seemed to agree with my assertion and implied that we should expect to see more of these types of conflicts as more organizations move to serve customers in a cloud-oriented environment. As is the case with most security controls, it seemed that the reputation scoring algorithm that drove the scores was ineffective when applied to mail received from a cloud-based mail service provider. For example, if Microsoft were to sign a new large corporate contract with all employees transitioning to an outgoing mail gateway, that increase in traffic could potentially trigger a similar condition for all Office 365 customers. That condition is novel when you consider that the same organization was likely managing its own servers and traffic patterns would typically change by only a small percentage each day based on employee turnover. Perhaps as more customers sign up for Office 365 services, the impact that any one new customer on the overall outgoing mail traffic will represent less of a change in traffic load. But, it seemed reasonable for us to think that a new Office 365 service, or one from any other provider, could be subject to wild traffic swings as new customers migrated to the service. That a new cloud service could be so impacted by customer transition patterns should be food for thought to any corporate executive considering a move to the cloud.

As Sunday, May 13 dawned, it seemed that the few email messages that did get through the outgoing mail block were merely taunting me into a false sense of hope. Only three of many more messages actually got through, and those were all to the primary contractor. Email to government accounts confirmed that no global progress was being made. Combined with our confidence that we thoroughly understood what the problem was, we proceeded to begin assigning blame.

Jump to Part 7: Blame When Things Go Wrong.

No comments:

Post a Comment