Microsoft has a couple of big things going on right now. They just has a somewhat shaky launch of their updated Windows Mobile 6.5 and are about to to start their retail launch on their highly anticipated Windows 7 platform. The last thing they needed was for the infrastructure behind their popular Sidekick device to crater in so spectacular and unrecoverable a manner.

One aspect of the Sidekick that has made it unique is it’s cloud based architecture. By default, all of a person’s contacts, emails, photos, & messages are stored up in the cloud. The storage on the device is used more like a local cache, with data persistence becoming a centralized service. The big advantage of this approach was security – if a Sidekick device ended up being lost or stolen, all of a person’s data would still be safe and easily sync-able with a replacement unit.
At least that was the theory.
The reality, described in this announcement from T-Mobile, has ended up being quite different:
Regrettably, based on Microsoft/Danger’s latest recovery assessment of their systems, we must now inform you that personal information stored on your device – such as contacts, calendar entries, to-do lists or photos – that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger. That said, our teams continue to work around-the-clock in hopes of discovering some way to recover this information. However, the likelihood of a successful outcome is extremely low.
I have never, in my experience, seen an outage like this happen – even in tiny start-ups that are running pretty lean and mean data centers. In situations where a catastrophic outage does happen, there will typically be a roll back to an earlier version of the system, with the loss of only recent updates. And that really is a worst case outage.
But somehow, in this outage, everyone’s’ data is just gone. All of it. No backup seems to be available.
For what’s it’s worth, Microsoft does have some really sharp engineering talent – a few that I know personally. They’ve been running massive data centers for quite a while and clearly understand operational best practices. That makes what happened here, at least to an outside observer, a complete enigma. Losing everything is simply unheard of in professional circles. Whatever the cause, this is just a screw-up of unprecedented proportions.
So what’s next for Microsoft?
They need to take public ownership of the situation. This means more than just getting to the bottom of what happened here and fixing it. Microsoft also needs to be completely transparent about what occurred. No matter how ugly or unflattering it may be, they need to discuss what went on openly and honestly. Most importantly, they need to communicate what steps they are taking to be sure that this type of event won’t happen with ANY Microsoft service again.
Ultimately, the biggest loss that took place as a result of this outage was the loss of trust – trust both in Microsoft and in their cloud based architectures. It may not be fair, but that’s the reality of the situation they find themselves in. This isn’t the time for Microsoft to just crank up the PR machine and try to spin this. They also shouldn’t try to blame the folks from Danger as a way to distance themselves from this mess. Neither of those approaches will repair the damage. When events like this occur, there is no quick fix. Rebuilding marketplace confidence will be a process that takes both time and effort.
There are a lot of people watching how Redmond responds to all of this. How they handle this situation will be critical to establishing the success or failure of their entire cloud based strategy – Azure, Office Live, and MyPhone.
The future of the company depends on them getting it right.

{ 2 comments… read them below or add one }
I think that they did a great job at first but now they really need to get it together.
I think that they did a great job at first but now they really need to get it together.