The Secret to Preventing Downtime : Audit Your Service Providers

July 18th, 2019


Reading Time: 6 minutes

Can we just pretend Summer never happened?

In the past 60 days, downtime would have been trending almost every week on Twitter: if it was up. It’s alarming; almost every major social media network, and even sites like Google, experienced downtime.

Significant downtime.

For some platforms, like LinkedIn, it was only 30 minutes, but for others like Snapchat, it was 4 ½ hours.

But I use the internet every day

Well the two largest outages, that took down more 10% of the internet combined, were Google and Cloudflare. Google had a software bug that was accidentally implemented to a larger number of servers than intended, so instead of a simple internal test they made the entire internet their Guinea pig.

 

 The Internet According to Google

It wouldn’t have been so bad if the test was successful instead of turning off the majority of their servers.

Not only did it take down Google services, like Youtube losing 10% of traffic and 10 million gmail users being affected, but it took down almost every service using Google Cloud.

Companies from Snapchat and Vimeo to Discord and even Pokemon GO (c’mon I was in a Raid Battle!).

Going forward Google said their emergency response tooling and procedures “will be reviewed, updated, and tested.” And they said they’ll extend their disaster recovery tests to include these types of failures.

And then came Cloudflare

Cloudflare had their own “software bug” of sorts as well. Just this one small piece of code: .*(?:.*=.*).

Without going too detailed, but it’s a great read for the coders out there, it created a loop that just spiraled out of control, maxed out all of their CPUs (which made it look like a DDOS attack at first) and caused some of the largest sites to go down.

Once everything maxed out, every customer they have, from Quizlet to Discord, went down with it. And, in a brilliant piece of irony, even Downdetector found itself looking elsewhere to detect the extend of their downtime.

Cloudflare CEO, Matthew Prince, did accept 100% responsibility for the outage and they have talked on their steps to prevent this from happening again.

What about the zuck

Then the day after the Cloudflare debacle, Facebook and Instagram had a file transfer problem that led to a significant outage. It took down their platform and they had to take to twitter to keep their users in the loop.

 

 You can’t hide from this one Mark

Unfortunately, transparency for Facebook is lackluster, but they did say that it was triggered during their routine maintenance.

It was fully resolved after about 11 hours, which basically means an entire day without Facebook or Instagram.

It would have been nice to get a more concise answer from Facebook for details of the outage, but it looks like we’ll have to wait until their next outage for that.

If their new outages are becoming a trend though, it’ll only be 2 months from now.

 

is tweety bird okay?

One week later, Twitter bites it.

On the bright side, going to their website during their downtime gave you a message that they knew something was wrong.

 

Twitter’s page during their downtime

They identified the issue at 2:58pm EDT and gave little information about what was going on during their outage.

A spokesperson referred inquirers to their status page. According to the status page, it was due to “an internal configuration change” which feels a bit like a page directly out of Facebook’s ‘triggered during routine maintenance’ book.

Fortunately, it was all resolved for most users by about 3:45pm EDT and was resolved for all users in almost exactly 24 hours.

 

Is anyone really affected by an outage

Even an outage from a third party can affect a company’s uptime tremendously; and outages don’t just affect employees, they affect the customers of the company having the outage. Anyone using Hubspot, Discord, Coinbase, DigitalOcean, Asana, Yelp, Lyft, and about 2,600 more companies would have all experienced this downtime. The Cloudflare outage by itself affected a staggering 9.9% of all websites on the internet.

 

How Cloudflare feels now

it’s only a little downtime

Well, in general, outages are fairly common, with about 50% of all companies experiencing 24 hours of downtime a year. And lost sales, advertising, accounting, and all affected departments, all quickly add up and cost your company.

There are options, probably the easiest is building in redundancy. Put simply, if one of your services fails, you have another avenue to compensate and immediately fill in that gap.

Or, if your provider is already redundant then you don’t need to worry as much. It might be difficult to know offhand if your partners are redundant, but every partner should discuss it at length in their SLA. If Cloudflare had a recovery plan that could be quickly implemented (AKA disaster recovery) then a failed update, or even a true DDOS attack, wouldn’t mean their customers are also affected by downtime.

Leave a Reply

Leave a Reply

  Subscribe  
Notify of

Let's Talk

+