Saturday, January 26, 2013

Cloud Computing Danger: Seeking Problem Clarity


This posting is Part 5 of the Case Study in Cloud Computing Dangers.

After establishing the legitimacy of our outgoing mail Denial-of-Service the morning of May 11, we expected Microsoft to resolve the issue by the end of the day. Since it was related to some SPAM condition associated with the Office 365 outgoing mail gateways, Microsoft should be able to rally its resources to quickly address the technical problems and enable us to re-establish communications with our largest customers. We were overly optimistic.

With our transition into Day 3 we started getting a much broader view of the problem scope. At 4:10 PM, a project colleague working for the primary contractor my company was subcontracted to received an internal email entitled "Blocked External Email from Potentially Unsafe Senders." The advisory stated that the organization's "email security vendor" provided new scores for many domains that dropped their email reputation score to a "Neutral" score, causing messages from those domains to be dropped as SPAM per the organization's SPAM threshold settings. Blocked email domains included several very large government contracting organizations and four U.S. government agency domains.

The irony at this point was that I knew from past experience that the organization was leveraging the Microsoft Forefront email protection gateway for SPAM filtering. Despite the pain of still being unable to communicate with our customers, it amused us to think that one Microsoft service was blocking another Microsoft service. Yet, Microsoft was still suggesting that we pursue "whitelisting" as a corrective action. I felt like we were living in some bad sequel to Office Space.

Business executives need to have a good understanding of hosted and cloud architectures to understand how this could happen. It took several days but I was confident at this point that I was able to completely understand how systemic failures in the cloud business architecture led to a high probability of this Denial-of-Service condition occurring for any organization.

Cloud services are available to satisfy just about any IT need that an organization may have. With regards to email, an organization may outsource the entire system to another company such as Microsoft, Google, or Network Solutions, or may just outsource pieces. In many cases, organizations may subscribe to services that check email for malicious software and SPAM characteristics prior to receiving the email on their systems. For one organization that I worked with in the past, that filtering reduced its email traffic by as much as 50%. Reductions at that level allow organizations to substantially reduce the expense of maintaining their email systems. 

Whereas full email outsourcing appeals most to small businesses, larger enterprises may opt to maintain their own services but leverage the email protection services to reduce email overhead. Because of this disparity in need, large service providers such as Microsoft have good cause to run the services separately to enable them to focus efforts on their target audience. From a business perspective, it would also be reasonable for the people responsible for maintaining the services to be aligned with completely different groups in the organization chart. In business, revenue streams generally trump service delivery similarities when it comes to place within the organization since the business generates revenue from customers, not services. A subtle point, but an important one when considering how an organization like Microsoft could allow one service line to block the expected functions of another. In this case, the two organizations probably had no direct interaction and little "business" motivation to collaborate.

Based on the architectural troubleshooting, we knew that email sent via our Office 365 accounts through a Microsoft outgoing mail gateway was being marked at a low SPAM rating issued by sendmail.org. When some incoming mail gateways, including those used by both of our problem recipients, noted the score, they were immediately rejecting the email as SPAM. But, we were not receiving any "bounce" messages notifying us that the messages were rejected because doing so only announces to SPAM purveyors when they found valid servers and addresses, so the recipient servers just dump the message, the Internet equivalent of sending light into a black hole. 

Later that day, the OP on the Office 365 thread that we were using to coordinate efforts with other affected companies announced that Microsoft tech support had stated that it was working on the problem and that it would be resolved by the end of Saturday, May 12, over 24 hours after acknowledging that the problem existed. A bit too little too late but we recognized that resolving big problems often takes a lot longer than one would hope. We were forced to accept that our denial-of-service would persist into the weekend.

Jump to Part 6: False Hope.