Monday, July 22, 2024

#CrowdStrike Cause a Global Tech Outage - what happened, why, and (how) can it be prevented?

While the memes are amazingly good, and there's a lot of jest being spewed across the interwebs, this is a serious event with massive implications. So, in all seriousness, let's review the facts of the #CrowdStrike situation from 19-Jul-2024: 

As reported across global news outlets and the internets, a security company called CrowdStrike caused some chaos. There are cascading impacts across many industries. 

We are already seeing impacts: 
://courier service delays (UPS, FedEx, DHL, etc.) 
://flight delays/cancellations at the airport 
://small business closing for the day 
://websites being inaccessible 
://hospitals cancelling surgeries/treatments 
://municipalities being closed 
://government services being delayed 
among many other cascading effects that could last days, or weeks. 

While a major inconvenience, the bug was quickly resolved within CrowdStrike's system, so (as of publish date) the latest binaries are stable. Recovery will be slow and tedious, especially for larger networks, but the world will recover from this. 

What happened? As is being reported, a bug introduced during a routine update of their Falcon EDR software (anti-virus software run by millions and millions of customers) caused what is known as a kernel panic within the Windows operating system - we are seeing this manifest as a "bugcheck error" (aka - the Blue Screen Of Death , or #BSOD) on Windows machines. It does not affect #Apple or #Linux devices. Note: It is NOT a #Microsoft problem. 

How can we prevent this? Short answer, WE as users can't. However, this isn't the first time a large global tech vendor has caused major outages across the globe, and it won't be the last. 

How can CrowdStrike, or any another company, prevent this? Simply, adhering to the SDLC methodologies, adequate QA testing, and never do a full production roll out without fully testing in the field. A common practice is to deploy to 10% of the network and see how systems and users respond (yes sysadmins, you can do targeted deployments even if you don't have network segmentation in place). If all goes well, push to 25% and test again, then 50% and test again, then the full push. That way when a problem does occur, it doesn't take out everything and can be quickly fixed before a full production push. It's really IT Ops 101 - not that difficult. This is a good example of why you should backup your critical data frequently: whether to an external device, or a cloud storage facility (Google Drive, Dropbox, OneDrive, etc.). You should do this personally as often as you feel is necessary. Most companies have policies governing backup types, schedules, and testing methodologies. 

For my enterprise admins reading this, I hope you have a solid (and tested) backup methodology in place. Yes, you should test-restore your backups at least once per year, if not more often. If you can't restore the data, then what is the point of backing it up? 

So now the big question is, how does this issue get fixed? Well, it's a hands-on-machine fix (which means long days/nights and weekends for IT staffers for a bit). Since the devices are unable to boot, there's no back-of-house configuration that we admins can set to fix this. We literally have to put our hands on the device. The methodology is simple, and only takes about 5 minutes to do - but multiply that over hundreds, thousands, or even hundreds-of-thousands of devices and you can quickly see this is not a quick fix at scale. It is an even bigger nightmare for remote workers, who would need to be walked through the fix via telephone, making it a 30min fix (at best). In those cases, from my perspective, it makes more sense to send them a replacement machine that is not bricked, then reset the trouble device once back in hand. Hopefully you have the inventory ready and waiting, otherwise you need to grab a company credit card and hit up every electronics store in your city. What a fucking PITA. 

CrowdStrike's official guidance can be found on their webpage here: https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/ (external link). 

While all of this is happening, myself and most of my peers agree that CrowdStrike is still a quality vendor offering quality security products and services. This was just a BIG fuckup from whoever pushes out their updates. Clearly, someone did not follow protocol. 

As of this writing, CrowdStrike is the second largest security vendor in the world, which is why the impact of this was as massive as it was...and the cascade effect isn't done yet. There will be more fall out from this, not to mention the legal cases that could be brought against them in the aftermath due to the downtime. 

One of the biggest fallouts of this mess is phishing attacks - threat actors spinning up malicious domains claiming to fix the issue (they won't, they just want your money); emails being sent claiming to be able to fix the issue with "a click" (using a piggy-back technique to install a payload on your machine to do god knows what; oh and steal your money too). Please do not fall for the phish. It's won't end well for you, or your employer. 

There is no "easy button" here peeps. Just a massive Pain In The Ass. 

#StayCyberSecure 
#BeCyberAware