Comment on the CrowdStrike preliminary report as sourced from Richard Ford, CTO at Integrity360.
July 2024 by Integrity360
CrowdStrike have now published their preliminary post incident report (PIR) into the issue that brought 8.5m Windows hosts, and a lot of the world, to a halt. Their preliminary report is available in full on the CrowdStrike website, but here are initial thoughts after reviewing the report and considering against the backdrop of what we’ve observed within our affected customer base.
With such a wide scale, and brand affecting, incident, the recovery for CrowdStrike was always going to be rooted in transparency. No software company will ever be 100% bug free, that’s just not reality, and issues, outages and vulnerabilities will occur. But we can judge a software organisation on two things; how robust their development & testing processes are to limit the frequency of issues and, when an incident occurs, how they respond to it.
The scrutiny placed on CrowdStrike is derived from their position in the IT stack. As an endpoint security platform, and specifically an Endpoint Detection & Response (EDR) solution, it operates in kernel mode via kernel drivers that permit access to lower-level internals of the Windows operating system. Operating in kernel mode gives an EDR great power to gain visibility into system processes and activity, and provides the ability to act and prevent malicious actions. But, as with Spiderman, with great power comes great responsibility. Kernel drivers must be developed to be completely robust and stable. Unlike in user mode, where a runtime issue can fail gracefully and only affect that application, failure of a kernel driver will lead to the type of exception that ends with a Blue Screen of Death (BSOD).
At the very tail of the preliminary report CrowdStrike have promised a future root cause analysis (RCA) once they have completed their investigation in full. Even with the promise of a full root & branch RCA, there’s a fair amount of detail in preliminary report. The transparency we need seems to be coming. If we look at how CrowdStrike have reacted in the face of adversity they’ve done a reasonably good job. They’ve held their hands up, they rolled back the faulty Channel File reasonably quickly, on the whole they’ve communicated with customers and partners regularly and often with updates, have provided fixes and recovery steps, and now we’re seeing some of the transparency required to rebuild that trust.
What does the report say?
We can fully test that last point once the full RCA is available as the preliminary report still leaves questions unanswered and niggling doubts. So, what does the preliminary report say? Within the report CrowdStrike detail their security content configuration update architecture, along with what happened and how these components had the affect they did.
CrowdStrike’s security content configuration architecture, as laid out in the PIR, is broken down into two component parts; the Sensor Content and the Rapid Response Content. The former is shipped only with the CrowdStrike Falcon agent updates, which are fully controllable by end users through the Sensor Update Policy settings and provides a wide range of security capabilities – either introduced or updated as part of Sensor Content updates. This includes new Template Types that allow threat detection engineers to define threat content. Rapid Response Content, on the other hand, are the security definitions and IOCs, that utilise capabilities and Template Types available in the Sensor Content updates, in order to instruct the Falcon agent on how to detect current and emerging threats. These are pushed globally to customers by CrowdStrike when available, regardless of and Sensor Update Policies.
In terms of what happened on the 19th July, CrowdStrike have outlined the series of events that led to the global outage as part of the preliminary report. Firstly, as part of a Sensor Content Update released 28th Feb 2024 (Falcon agent v7.11), a new IPC Template Type was introduced to detect novel attack techniques that abuse Named Pipes. Releases of Sensor Content is rigorously tested through unit testing, integration testing, performance testing and stress testing, and then further tested internally and with early adopters prior to being made generally available. This was the case with this update and the new IPC Template Type, with stress testing completed on the 5th March 2024 and successful deployments to production completed on the 8th & 24th April 2024.
The problem is when we look at the testing of the IPC Template Instances that make up the Rapid Response Content. It appears, from the information available in the preliminary report, that these are only tested by a Content Validator tool that performs validation checks on content prior to being released. Unfortunately, in this instance, a bug in this tool allowed the invalid content to pass muster and, along with the confidence in the stress testing and success of the previous releases, ended up with the corrupt file being pushed to all online Falcon agents.
So clearly there was a deficiency in the testing process when it came to Rapid Response Content, and probably down to the fact this was never considered an issue, or the impact of an issue with it never fully considered. That, and the level of vigorous testing carried out on Sensor Content Updates. The other issue was the deployment strategy. Deploying globally meant the issue was that much more impactful, and the rollback and recovery that much more difficult once the error had been identified.
Lesson learnt. CrowdStrike are implementing steps to make sure this doesn’t happen again:
Software Resiliency and Testing
Improve Rapid Response Content testing by using testing types such as:
Local developer testing
Content update and rollback testing
Stress testing, fuzzing and fault injection
Stability testing
Content interface testing
Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Enhance existing error handling in the Content Interpreter.
Rapid Response Content Deployment
Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
Provide content update details via release notes, which customers can subscribe to.
There are still some questions that need to be answered, and I’m sure will come out once the full RCA is released. One of the core questions is not how the Content Validator missed the invalid file but how did that file become invalid in the first place?
As we get closer to the end of this incident, I think it’s clear that we will look back on it, and the way it was handled by CrowdStrike, as an example of what good can look like in the face of adversity. They’ve been transparent, they’ve quickly implemented the immediate fix and identified the long-term solution to prevent it from happening again and they actively engaged with customer and partners to recover. There are valuable lessons to learn and implement across the industry.