What Should We Expect from Cloud Providers During an Outage?
By Seth Miller | June 25th, 2014
Reality check: Outages happen. They happen with on premises systems, and they happen in the cloud. It doesn’t matter whose service it is. We could argue about who’s more reliable, but in the end, assuming that you’re up enough of the time that it’s a tenable solution, we generally judge our providers on how they handle the outages.
Here’s How Not to Handle an Outage
There was a fairly widespread and serious Microsoft Lync Online outage on June 23 2014 that lasted for over eight (8) hours, from the late morning into the evening (Eastern Time). During the outage, Lync was completely unavailable to a large group of customers. After the dust settled, we took a look at the way it got handled. A number of things rubbed us the wrong way about this particular case.
It took too long to get diagnosed, acknowledged, and resolved.
At Miller Systems, we primarily use Lync for IM and screen sharing, and the occasional video call. We have alternative methods for doing those things, so this wasn’t the end of the world. But if Lync Online were our primary phone system, as it is for plenty of people, we’d have a pretty severe reaction to a situation like this. If there were a natural disaster, emergency incident, etc., we’d probably cut Microsoft a little slack – but this was plain old-fashioned garden variety “something broke” downtime.
If you expect to use a self-service dashboard for primary support, you’d better come clean – and do it quickly.
We were affected; it was hard to miss. We were all suddenly, forcibly logged out of Lync at around 12:00pm (ET). Microsoft didn’t post their “Investigating” status update (marked as “12:06 pm”) until they also posted an acknowledgment of “Service Degradation” at 12:41 pm.
That’s nearly an hour of “what’s going on?” accompanied by some questionable timeline slight-of-hand. The fact that we didn’t hear a peep from Microsoft until nearly an hour into a “total downtime” for this service is squarely in the “unacceptable” category.
Own the real problem. Don’t spin it – that just upsets customers and makes IT look bad.
Microsoft’s message on the Service Health Dashboard was curiously framed as a relatively mundane sign-in issue…
“Customer Impact: Affected customers are seeing a spinning circle when attempting to sign in to the Lync Online Service. Customers who were connected to the service were signed out, and were then unable to sign back in.”
“Percent of Users Affected: Customers experiencing this issue may see up to 100% of their users affected.”
And then later… once some of the sign in issues were purportedly resolved…
“Upon a successful login, customers may experience reduced functionality.”
Come on, now. If you can’t sign in, you can’t use Lync. Hiding behind “Sign-In” as the only problem is a major cop-out. Fact: this was a total outage. If you’re going to use a dashboard to support users, you have to be truthful.
At this point, given the “reduced functionality” they’ve described to us, how are no other services listed as being in some sort of non-optimal state?
Set manageable expectations.
“Restoring Service” started being the message at 1:45 pm. But the issue wasn’t resolved until well into the evening. According to the dashboard, the final resolution was at 8:24 pm ET, but it may have been even longer. That’s too long to expect customers to wait, especially during the business day – but worse, if we see “restoring service” at 1:45, we certainly don’t expect a resolution to take 6-7 hours.
Follow through and provide closure, everywhere you’ve opened a dialog.
Microsoft was kind enough to tweet something at around 4pm ET (and hey, what took so long?), but they never tweeted again to let people know that the issue was resolved. Also, is it that hard to provide a shortlink to the Portal? What if “SHD” (Service Health Dashboard). isn’t part of your everyday vernacular?
We did eventually get our hands on a post-incident report, however, we located it on our own. As system admins we would have appreciated any proactive communication directing us to the report – or any other type of incident summary – so we could be confident in the final resolution.
We’re not Microsoft-bashing, but there were clearly a number of opportunities for them to do a better job here. That certainly goes for the tech side of things, but perhaps more importantly, the communication could have been far better than this. Be sure to ask your provider what their process(es) are for handling and communicating similar situations when they arise.
Update 6/25/2014: 2 days later, the Office 365 Twitter account’s last mention of Lync is the one we’ve pointed about in which they acknowledge the outage.