Blue Screens Of Death Invade Planet Earth

August 6, 2024 by guest author, Viktor Petersson

Guest Post: Viktor Petersson, Screenly

It’s Saturday morning. Emergency services in Alaska are knocked out, meaning you can’t even get through to 911. Atlanta, Delta’s hub, is full of people sleeping wherever they can, and the queue to the help desk is more than a mile long. Thousands of flights are grounded, and there are neither hotels nor rental cars available in the area. Doctors around the world are turning away patients because they can’t access their patient records. What is the common denominator? A piece of software from a company called CrowdStrike.

For those not familiar with CrowdStrike, or specifically the Falcon sensor platform from CrowdStrike, it is an endpoint security product. This is fancy security industry jargon for a piece of software that ensures a computer (or device) monitors and enforces requirements as set by a policy (e.g., is the firewall enabled?). They can do more than this, but you get the gist. In the world of enterprise software, deploying endpoint security products is very common. In part, the reason for this is that these companies need to adhere to various compliance frameworks, such as SOC 2. CrowdStrike is considered by many as the path of least resistance to get the auditors off your back.

What is relevant is that this piece of software is usually installed on all devices, including digital signage players running Windows. This is why CrowdStrike impacted a large number of digital signage players around the world.

While we still don’t really know the full extent of the outage, by most estimates, it is one of the largest IT outages recorded so far. The impact spanned both on-premises devices and servers running on cloud providers like Azure. What’s worse, this happened on a Friday. The golden rule across most tech organizations is “never deploy on Fridays.” This was clearly not heeded here.

We talked in more detail about the outage on our podcast, The Changelog, but in short, here’s what happened:

Now, this is where it gets really messy. This piece of software interfaces at the lowest level with Windows (i.e., with the kernel) and was invoked very early in the boot process. Had this just been a piece of software running in Windows, little harm would have been done. The software would likely have crashed, and that would have been the end of the story.

This was not the case here.

Since the software crashed so early in the boot process, the entire system ground to a halt. The only way to recover the device was to enter recovery mode and run a few commands. There is no way to do this remotely. Neither Microsoft nor CrowdStrike can help you here. You needed someone to be there and do it in person. If it happened on a handful of devices, it might not have been a big deal, but we’re talking about millions of devices around the globe.

But wait, it gets worse. Since most CrowdStrike users were enterprises, many, if not most, mandate Full Disk Encryption (FDE) on their devices as part of their endpoint policy. In the case of Windows, this is normally done using BitLocker. To enter recovery mode on a device with BitLocker enabled, you need to enter the BitLocker passphrase to unlock the device. This is a 48-digit numeric string that needs to be entered by hand.

So not only do you need to enter a handful of commands, but you also need to enter the BitLocker password before you even get to that point.

It’s hard to stress how bad this is. The vast majority of these devices were deployed at sites where no engineers were available to visit in a timely fashion. Most companies relied on trying to train non-technical staff to do this out of necessity.

We have seen some very clever tricks though. Some people out there realized that you could use a barcode reader to help with this. For those not familiar with how barcode readers work from a technical perspective, they are just keyboards. That means you could print a piece of paper with a few barcodes to do the operations (enter passphrase, run commands, etc.).

So what have we learned?

There are a lot of lessons learned here. It’s easy to point the finger at CrowdStrike, but there is plenty of blame to go around. Some have blamed Microsoft for having poor error handling. This is no doubt true. Microsoft, in turn, is ironically pointing the finger at the EU for forcing them to open up the lower level of their operating system (i.e., the kernel) to third-party vendors like CrowdStrike. Others blame the individual IT teams at the affected organizations for not having done their part in testing and gradually rolling out this update. Had they done this, this disaster could have been easily mitigated.

But blame aside, let’s look at what we can do as an industry to prevent the next occurrence of this, but let’s limit the scope to the world of digital signage.

Pick the right tool for the job

Windows is a very poor option for digital signage to start with. It was never designed as an operating system for this type of use case. Yes, you can use it, but it’s a bit like if you have a hammer, everything looks like a nail. The reason why endpoint security is more or less a hard requirement for Windows devices in the enterprise world is because there are so many tweaks that need to be done. This is not the case for, say, most smartphones or smart thermostats.

What you really need is an operating system that is more designed for IoT workloads than a desktop environment. If you need to “Remote Desktop” into your signage players, you’re doing it wrong.

Much of the digital signage world runs on Linux (of various flavors). It’s not to say that Linux doesn’t have its own shortcomings (most Linux devices are deployed without a device management for instance), but you are starting with a stronger baseline.

In fact, Android and SmartTV operating systems are probably better choices than Windows (assuming they are being updated and secured responsibly).

Not to say that something similar could not have happened to any other operating system, but the probability is lower.

Have proper deployment mechanisms in place

If you are a vendor that produces digital signage software that goes onto players (like we are at Screenly), you need to have proper procedures in place for how you roll out your software. At Screenly, we use something called staged rollouts. It’s what all large tech companies like Google and Meta use. What this means is that we never roll out updates to 100% of our devices in one go. Instead, we start with 5% of the devices. If we don’t see any issues within the first 24 hours, we bump it up to 10%, and then gradually increase it more aggressively.

If you are a buyer of digital signage software as a service where the signage players are managed by a vendor, make sure that they do staged deployments. If you are managing and deploying yourself, make sure that you have processes in place to gradually deploy updates (such as “canary deployments”) to catch things early. Ideally, this should be combined with “rollbacks” that allow you to reverse the update quickly.

Had CrowdStrike done a staged rollout, this would have been caught way earlier and the deployment would have been stopped.

Oh, and don’t deploy on Fridays!

The case for diversified environments

The reason why digital signage was less impacted than many other industries was because it’s a lot less homogenous than, say, the desktop market. There are loads of vendors, which all have their own preferences. Between cheap Android players from China, to custom Smart TV operating systems (like Tizen), to customized Linux distributions, one thing one cannot say is that the signage market is homogenous. And this is actually a good thing. Had the digital signage market been as homogenous as the PC desktop market, the impact for signage would have been far greater.

Many companies talk about consolidating around a single piece of software to simplify management (and often cost), but  this is a good argument for diversified systems. When the next Crowdstrike issue hits, the more diverse the environment is, the smaller the blast radius.

ABOUT THE WRITER

Viktor Petersson is the founder and CEO/CTO of Screenly, self-described as the ultimate digital signage platform. Accessible, versatile, and powering content across devices globally. With over 10,000 screens worldwide, it’s developer-friendly for seamless integration and boundless creativity. He also the founder of sbomify and has a podcast: Nerding Out With Viktor.

  1. Wes Dixon says:

    There is another step as well: finish the job! Do not cheap out on security or access. Yes it will cost you more, but wouldn’t you rather do the reboot and all of that other “recovery” from your desk?

  2. Viktor Petersson says:

    100% Wes. Security is usually an afterthought at best.

Leave a comment