When software updates go wrong: lessons from the CrowdStrike outage
In July 2024, a faulty software update issued by the cybersecurity firm CrowdStrike caused Windows computer systems to crash around the world. The resulting IT outage created enormous disruption – and some extremely bad press – for the security company.
So, what can we learn from the incident? One thing’s for sure, what happened in July should give software engineering leaders in every industry sector pause for thought.
Make no bones about it, CrowdStrike is a sophisticated organisation. It’s packed with talented engineers working on incredibly complex technology. Looking from the outside, this doesn’t feel like a systemic failure. Rather, it’s more a cautionary tale about how one small mistake during software testing can have massive impacts. And it can happen to anyone.
Let’s remind ourselves what organisations should be doing when it comes to testing – and why. A lot of these points will be familiar and most will seem pretty obvious. But in the press of daily business, it’s easy to forget a few steps or attempt to short-circuit the process. As we’ve seen with CrowdStrike, that can lead to some rather onerous consequences.
Local developer testing
One of the best ways to prevent engineering slip-ups is to initiate local testing. Today’s powerful PCs are more than capable of running a full stack, which means your engineers can safely code and test in isolation. With a little forethought, you can also enable them to use real data in these air gapped testing environments. When testing happens locally, developers can iterate faster and make changes based on real-time feedback. There’s no waiting around for results from distant servers. Plus, sensitive data stays where it should – under your roof.
Content update and rollback testing
If a new release creates an unexpected problem, it pays to be able to revert to an earlier stable version with minimum impact. Since today’s organisations typically build software using CI/CD pipelines, automatically rolling out the last good version shouldn’t be too challenging. That said, to minimise service disruptions, you’ll want to test that this can happen in a matter of minutes. Similarly, having a sound understanding of what ‘good’ looks like is important. Which means monitoring and benchmarking your baseline environment to ensure that when you do test, you can spot any discrepancies and performance failures.
Stress testing
You can’t skimp on this. Only testing what you know – a typical customer flow, for example – leaves you wide open to the unexpected happening. To ensure the stress testing process is comprehensive, think about every possible way users might engage with the product and check its robustness goes beyond normal operational parameters. This is one of those times when ‘good enough’ isn’t usually enough.
Stability/reliability testing
The more complex your environment, the more critical thinking will be needed to ensure your new software can handle whatever is thrown at it without creating any operating system or memory leak issues. To optimise this process and synthesise real-world scenarios, comb through historic log files to gather events that can be used to populate simulations. These can then be distilled down into an intensive one-hour test procedure in a local environment. Bear in mind though that larger and more complex environments may require a different and more iterative testing approach.
Progressive delivery
Should things go wrong, you’ll want to have a delivery strategy that will limit the blast radius. Staged rollouts are a standard approach that makes the release process more resilient and controlled. This may involve undertaking limited or selective rollouts to specific customer segments or geographies and then monitoring the user experience before releasing more widely. Beta testing is another tried and tested progressive delivery procedure.
Third party validation
Getting an expert and objective third party to assess your estate and software development process will help accelerate delivery in a safe and risk-free way. And that’s where we can help. Over the years, we’ve helped fast-growing start-ups establish best practice guardrails to rapidly roll-out new features and updates in a highly structured way.
Similarly, we’ve worked with long established firms that want to retrofit their environment and backport modern rollback and testing approaches. An especially challenging proposition if their estates contain decades old software systems and databases systems. The key message here is that where there’s a will, there’s always a way.
If you’d like to elevate the security and robustness of your software development process and prevent the likelihood of bugs reaching production and crashing systems or applications, why not get in touch?