It took six days after I detected an outgoing mail Denial-of-Service for Microsoft to publish a public admission that a problem did truly exist. In the contemporary fast-paced IT world, for any problem to take six days to recognize is like waiting to be taken across the river Styx. But, I doubt that Microsoft was working on it's obituary.
Cause
Currently Office 365 outbound email servers have a SenderBase reputation of neutral and a score of 0. As a result any policy set to throttle or reject mail from a server rated neutral or with a score of less than or equal to 0 may impact delivery of the mail from Office 365 customers.  
Microsoft currently believes this is due to an instance where a large number of bulk mail messages were sent to a user via a server that contributes reputation information. This mail did not get classified as spam by us, the sender is reputable, but the volumes, combined with Cisco’s rating system, have temporarily reduced our email servers' reputation in their SenderBase service. According to Cisco, it will take time and additional mail flowing through their system to retrain it and restore our email servers’ reputation.
Workaround
Organizations that are unable to receive mail should consult their Cisco documentation to determine how to override reputation filtering for the Microsoft IP addresses listed in Forefront Online Protection for Exchange Online URLs and IP Addresses.
This initial statement from Microsoft illustrates how the cloud is driving IT culture from a detect and respond behavior to a report and wait behavior. Even a company as large as Microsoft, one that was "all hands on deck" in responding to this incident three days prior, couldn't do anything to fix the problem. 
Everyone was doing their own thing right, but no one was working together on making sure that the integration points were sound. It begs the question, "Who's responsible for making sure that everything works well TOGETHER?" When an organization deploys and manages its own services, then those integration points are generally all under some level of that organization's control. In a cloud services context where multiple providers combine individual services into a consolidating offering, the control boundaries are a lot more difficult to discern.
I submit that the optimum and most rational conclusion is that the service consolidator is the one most responsible for managing the integrity of the integration points. In this case, perhaps Microsoft was not "at fault," but it had accepted responsibility over the proper functioning of our email services when we entered into an Office 365 contract and usage of that service was the root cause of the problem. In this instance, I believe that Microsoft initially failed to act appropriately in response to our Denial-of-Service, ignoring the pleas from a couple of small businesses until we learned to effectively game the system to get attention. Even at this point, six days after I detected the problem, Microsoft is claiming dependence on Cisco and passing responsibility back to Office 365 customers.
Based on our research, we were able to determine that several Outlook.com gateways retained positive ratings through this entire event. That fact begs the question of why Microsoft didn't transfer the outgoing mail delivery of effected Office 365 customers to those alternative gateways. I make no claim as to understanding the actual system architecture that Microsoft has implemented and whether our transference suggestion would have been possible, but Microsoft's failure to adequately respond to our related inquiries leaves only room for supposition.
Organizations seeking to subscribe to cloud services should understand from this case that relinquishing operational responsibility doesn't relieve them of understanding how the services will function to support their business needs. In the productivity domain that Office 365 embodies, that understanding includes areas such as how the services depend on external functions and how the service provider will respond to indirect outages. Also understand the subtle difference between relinquishing operational responsibility and relinquishing functional responsibility. My colleague Jason Shropshire and I gained a whole new appreciation for how complex Business Continuity Planning becomes when subscribing to cloud services for critical business functions. Despite choosing a leading service provider that should be more capable of addressing critical failures than we would have been, we should never have assumed that the promise of cloud computing would translate into a decreased disaster potential. 
In my career as either a in-the-trenches consultant or as a CIO, I have never before faced as catastrophic a critical business system failure as my company faced with Office 365. If I had been a CIO responsible for moving my company to Office 365 only to face the worst outage imaginable, I would have either tendered my resignation or legitimately expected to be fired. My caution to those responsible for cloud services within their organizations is that they recognize that subscribing to cloud services changes only the perception of risk. In reality, organizations simply transform operational risk into individual and personal risk. If your organization has no reasonable expectation of dropping a cloud service provider when things go wrong, then there is no significant risk transference to the service provider.
As May 15, 2012 dawned, we had moved into Day 7 of our outgoing mail Denial-of-Service on Office 365. Our patchwork solution that augmented the working functions of Office 635 with outgoing mail delivery from nsMail represented a Frankenstein's monster of business services, but at least we had life support running. All we could do was stand by and wait for the problem to simply go away.
Jump to the final posting in this series: Just Forget About It.
Jump to the final posting in this series: Just Forget About It.
 
No comments:
Post a Comment