On Friday, July 19th, CrowdStrike, a cybersecurity technology company providing endpoint security and cybersecurity response services, experienced a significant outage that impacted an estimated 8.5 million devices. CrowdStrike and the blue screen of death briefly became synonymous as the error popped up on millions of devices globally. EZO wasn’t affected by the CrowdStrike outage, but we are currently researching how many of our customers experienced problems and to which extent.
In this article, we’d like to break down some key concepts and help readers understand what happened. Then, we’ll shed light on how such an outage can be prevented by other software applications with access to IT devices.
Understanding MS Ring 0 and Ring 3
In computer systems, particularly those running Microsoft Windows, there are different levels of privilege for executing code:
- Ring 0 (Kernel Mode): This is the most privileged level where the core of the operating system runs. Code running in Ring 0 has unrestricted access to all hardware and memory. It’s crucial for performance and stability.
- Ring 3 (User Mode): This is a less privileged level where regular applications and software run. Code running in Ring 3 has restricted access to hardware and system resources. This prevents accidental or malicious interference with the system.
Kernel mode only allows trusted code because any errors here can disrupt the entire system.
Applications (programs like web browsers or games) need to use hardware (like keyboards and printers) to function. The operating system acts as a mediator, managing these requests and coordinating with the hardware to deliver the needed services.
Device drivers are specialized pieces of software that know how to operate specific hardware. They act as translators between the operating system and the hardware.
Device drivers are crucial for compatibility, safety, and communication which is why they are carefully developed and tested to ensure they meet safety and performance standards. Once approved, they can operate in kernel mode, directly and efficiently managing the hardware to keep the computer running smoothly and securely.
Testing standards for the device driver
WHQL
WHQL (Windows Hardware Quality Labs) is a testing process conducted by Microsoft. It ensures that hardware and drivers meet certain standards of compatibility and reliability with Windows. When a driver or software passes WHQL testing, it receives a certification that indicates it should work well with the Windows operating system.
If you are an anti-malware software like CrowdStrike, you also need an ELAM certification to operate in kernel mode.
ELAM
ELAM (Early Launch Anti-Malware) is a specific certification requirement for anti-malware drivers. It ensures that these drivers can be trusted to load early in the boot process before other third-party drivers. This provides a first line of defense against rootkits and other malicious software.
Why CrowdStrike’s agent needs kernel mode access
CrowdStrike’s security agent needs to operate in Kernel Mode (Ring 0) to perform its functions effectively. By being in Kernel Mode, the agent can:
- Monitor and intercept low-level system activities
- Protect the system from malicious actions that might bypass user mode protections
- Ensure high-performance and real-time security measures
When code running in Kernel Mode (Ring 0) crashes, it results in a system crash or blue screen of death (BSOD). That’s likely what happened during the CrowdStrike outage.
This happens because:
- Kernel mode has direct access to all system resources. A failure here can corrupt the core functions of the operating system.
- The system cannot recover gracefully from such a crash, leading to an immediate shutdown or restart to prevent further damage.
The CrowdStrike outage:
The key factor that led to the CrowdStrike outage could likely be the use of P-Code within their security agent. P-Code is an intermediate representation of code that can be executed by a virtual machine within the kernel.
This approach allows CrowdStrike to introduce segments of code that are not directly compiled into the driver. Thereby, bypassing the WHQL (Windows Hardware Quality Labs) certification process. However, this practice also introduced the risk of deploying unverified and potentially unstable code into the kernel, ultimately leading to the system crashes and widespread disruptions experienced during the outage.
In addition to WHQL, anti-malware drivers like CrowdStrike’s also need to pass the ELAM certification. In the case of CrowdStrike, while ELAM certification ensures that the driver can launch early, it does not encompass all the changes made through the use of P-Code.
This loophole meant that CrowdStrike could update parts of their agent without full WHQL certification, focusing instead on maintaining ELAM compliance. This led to a situation where the agent, despite having ELAM certification, introduced unverified code into the kernel. This caused system instability and crashes observed during the outage.
CrowdStrike’s boot-start driver and system stability
It is essential to include the mention of CrowdStrike marking their driver as a boot-start driver. This decision implies that CrowdStrike aims to ensure their driver is active from the very beginning of the system startup, providing protection from the earliest stages of the boot process.
However, this also introduces a significant risk: if the CrowdStrike driver encounters an issue and crashes, it can prevent the Windows system from starting correctly. The system may only be able to restart in safe mode, where the CrowdStrike driver is not loaded, to troubleshoot and resolve the issue.
This requires physical access to the system and direct manipulation of the culprit files.
CrowdStrike and the blue screen of death explained
The outage occurred because of a problem with CrowdStrike’s agent operating in kernel mode and running an unsigned code segment. When this agent encountered an issue, it caused system crashes (BSOD) across many devices. Since the agent’s driver is marked as boot-start, the crashes were severe and led to widespread disruptions that could only be solved by direct manipulation.
Any software that has access to IT devices can cause a similar outage though the impact may vary.
EZO’s product suite includes AssetSonar – an IT Asset Management (ITAM) and Software Management (SAM) solution – which is used by high-importance industries such as tech, education, healthcare, financial services, etc. Let’s briefly look at how AssetSonar operates and how such an outage is protected against by designing for it ground up.
AssetSonar’s approach to stability and security
EZO’s ITAM and SAM solution, AssetSonar’s agent, operates exclusively in user mode and does not run any code in kernel mode.
This design choice ensures that the agent cannot cause the system to crash, as it operates with restricted access to system resources.
User mode applications, like the AssetSonar Agent, are isolated from the core functions of the operating system. So, any issues or crashes they encounter are contained within the application itself and do not impact the overall stability of the system. This makes AssetSonar’s approach inherently safer and more stable, avoiding the risks associated with kernel mode operations.
AssetSonar prioritizes the reliability and security of our platform, while implementing advanced safeguards to protect our customers. The recent issues experienced by CrowdStrike, whether due to a software defect or a cyberattack, highlight the importance of a comprehensive approach, similar to the one used for AssetSonar.
- Rigorous quality assurance: Our updates undergo extensive testing and phased rollouts to ensure stability and prevent disruptions. We implement critical functionalities in user mode rather than at the kernel level, minimizing the risk of system-wide crashes.
Proactive communication and support: We maintain transparent communication with our customers, providing timely notifications and detailed release notes. Our dedicated support team is always ready to assist in resolving any issues promptly.
Advanced security measures (in our roadmap): We will employ real-time monitoring, behavioral analysis, and strict access controls to detect and mitigate potential threats. We will ensure secure update channels and maintain a robust incident response plan to address any incidents swiftly.
By combining these measures, AssetSonar remains secure and reliable, providing peace of mind and protecting your critical assets.