We started the business day on May 14, 2012 finally able to send email to the primary contractor on our VA project, but not to the VA email accounts. This development was not an indication that Day 5 represented the end of our outgoing mail Denial-of-Service between our Office 365 cloud service and just about any mail gateway using Cisco devices or any other devices that used senderbase.org to receive SPAM reputation scoring. The organization had simply been shamed (either within or without) into lowering its SPAM blocking threshold to allow any email through that was rated Neutral. Not only was the organization the victim of being unable to receive legitimate email from business partners and clients, it was forced into a making a business decision that would allow more malicious messages to pass through the gateway. It was not a good sign.
Microsoft's "all hands on deck" response was getting us nowhere. So, my colleague Jason Shropshire continued to pester the Cisco Security Twitter feed pleading for assistance. As we moved into Day 6 of trying to find resolution in the organizational chaos between two of the largest technology companies in the world, Cisco responded with two statements over Twitter at around 6 PM:
"We are working closely with MSFT to help with the recovery of the reputation to the IPs affected."
and,
"Cisco observed that MSFT had a spam incident last week that caused a negative impact to the reputation on some of their IP addresses."
I wouldn't interpret Cisco's response as definitive proof that Microsoft had an incident but it did give us an important piece of information to help firm up our understanding of the architectural failure that had occurred. The reputation scoring algorithm had detected an anomaly on the Microsoft side that caused the score to drop. That scoring event represented the root cause to all of our problems.
Nonetheless, Cisco's statement was not as much encouraging as it was damning with regards to how it confirmed my suspicions of the overall architectural defect that put any cloud-based messaging service at risk. Because of some indication that a cloud service provider was suffering from an abnormal traffic event, the reputation score for that provider had dropped. Because this occurred in an automated way, suddenly Cisco was blocking Microsoft email traffic. 
Assuming that an outlook.com outgoing mail gateway did suffer a SPAM incident then how would Microsoft be able to correct the resulting Denial-of-Service? The fact that we were approaching Day 6 of this event must indicate a problem far broader than a simple automated technology incident. Cloud service providers likely depend on a lot of different organizations to function properly, even those that the provider does not have direct business with. 
Organizations should recognize from this case that the industry has failed to implement any standard way for providers to collaborate on solutions. Deflect responsibility, blame others, wait, and then fix. That seems to be the standard operating procedure demonstrated by this event. Since we've come to accept defect as a standard business philosophy, we implicitly acknowledge this lack of collaborative solution-making and the risk it puts our businesses at when we sign the contract for cloud services.
This puts organizations in the relatively unusual position of being unable to affect resolution to problems that occur behind the cloud service provider's operational boundary. But, that's not to say that businesses have no options, only that organizations should have contingency plans that will ultimately enable them to bypass the cloud provider altogether.
In this case, we were fortunate that the Office 365 incoming mail gateway was functioning normally, allowing us to receive mail from those same organizations that were rejecting our outgoing mail. Over the weekend, Jason had realized that we would need to find a way to temporarily minimize our dependence on Office 365 for email delivery. Our company was still under contract with Network Solutions for its nsMail Pro cloud mail service, the service that we used prior to our Microsoft BPOS/Office 365 migration. Jason reasoned that we could use our .net domain (i.e. example.net) through nsMail Pro to send outgoing mail to the problem email gateways to get around the SPAM reputation issues. In hindsight, we should have done that shortly after we realized that the problem wouldn't be solved quickly.
Jason's solution wasn't elegant, but it was effective at re-enabling our outgoing email service to the problem gateways.
- Establish new accounts on nsMail Pro for each appropriate staff member using the corporate .net domain.
- Set up each Outlook installation to incorporate the new accounts. This may have not worked that well for organizations that only use the web-based interface for email, but our company generally used Outlook to maximize the Exchange capabilities on Office 365.
- Configure the new Outlook account to set the "Reply To" field to the corresponding .com address for the staff member. This would ensure that any reply to our .net address would go to our standard .com inbox.
- Set the .net email address to be the default address for sending mail.
Another reasonable solution would have been to dump Office 365 altogether. But, noting that Microsoft did not do anything "technically" wrong, then we would not have had a case for breaking our contract. Also, our heavy use of SharePoint would have made such a transition difficult. So, we were forced to simply implement a temporary fix until Microsoft could resolve the problem.
Our situation illustrates a good "gotcha" that organizations should be sure to assess carefully. While it may be easy to stand up a cloud service to handle just about any business function, cloud service providers will frequently mix commodity services with proprietary services to make it difficult for businesses to extract themselves from the provider. In our case, deploying internal SharePoint sites that we were using to manage back office business functions made migrating to another cloud messaging provider difficult. The more services that an organization subscribes to through one provider, the more difficult it may be to extract or transfer business information and processes.
120 hours into our outgoing mail Denial-of-Service on Office 365 and we were forced to implement our own workaround due to Microsoft's complete failure to fix the problem. Our company had assumed that, since we were subscribing to a cloud service with a reasonably high availability, we didn't really need to draft a contingency plan for a greater than a few hour outage. Organizations should take note of how wrong we were.
Jump to Part 9: Stand By and Wait.
Jump to Part 9: Stand By and Wait.
 
No comments:
Post a Comment