Skip to main content

The Shift from Chaos to Controlled Reliability Testing

Chaos engineering, the practice of proactively injecting failure to test system resilience,  has evolved. For enterprises today, the focus has shifted from chaos to reliability testing at scale.

“Chaos testing, chaos engineering is a little bit of misnomer,” Kolton Andrus, founder and CEO of Gremlin, told SD Times about the term with which he launched the company. “It was cool and hot for a little while, but a lot of companies aren’t really interested in chaos. They’re interested in reliability.”

For large enterprises, disaster recovery testing—such as a data center evacuation or testing the failure of a cloud region—is a massive undertaking. Customers have spent hundreds of engineering man-months to put these exercises together, resulting in infrequent tests. This leaves organizations vulnerable to risks that only appear under load.

The new focus is on building scaffolding to make this testing repeatable and easy to run across a whole company by clicking a few buttons. Andrus  noted that a crucial element is safety, with Gremlin integrating into system health signals to ensure that if anything goes wrong, the changes are cleaned up, rolled back, or reverted immediately, preventing actual customer risk.

How to Test Against a Cloud Data Center

A key question for any company is how to simulate a major failure—like an AWS data center outage. “Ultimately, we are doing some disruption in production because that’s what you’re testing,” Andrus explained. Gremlin’s  tooling can essentially create a network partition around a data center or availability zone. “So if I’ve got three zones, I can make one zone a true split brain. It can only see itself, it can only talk to itself.” By doing testing at the network layer, he said, organizations benefit by having the ability to undo things quickly if things are going wrong. “We’re not making an API call to AWS and saying ‘Shut down Dynamo, and remove these buckets.’  Or, shut down all my EC2 instances in this zone for an hour, because that’s hard to revert and you might get throttled by the AWS API when you’re bring it back up.” To address this issue, Andrus said Gremlin was built to be zone redundant from the beginning, so if one zone’s data centers fail, the application can keep running in another zone.

While the direct revenue impact—calculated by looking at the estimated number of expected orders versus the drop in actual orders—is the floor of an outage’s cost, the total impact is much greater. This includes a substantial engineering cost: teams spending days finding, fixing, triaging, and then figuring out the root cause, followed by meetings and follow-up work.

When tests fail, the remediation is guided by reliability intelligence, which draws from millions of previous experiments run through Gremlin to deduce likely causes and provide concrete, concise recommendations on how to fix the issues.

The biggest risks are often not the network itself, but the resulting failures in microservices. Subtle points like running in multiple regions but relying on a database in only one, or not distributing state among zones, can cause issues like lost customer carts or transactions. The company-wide testing is focused on the “glue and all the wiring” that connects services—DNS, traffic routing, and propagating important data across zones. 

Ultimately, Andrus said, it’s about “finding those risks and fixing them so when the real thing happens, you  don’t get surprised by this alternate behavior.”

The post The Shift from Chaos to Controlled Reliability Testing appeared first on SD Times.



from SD Times https://ift.tt/wk30ath

Comments

Popular posts from this blog

A guide to data integration tools

CData Software is a leader in data access and connectivity solutions. It specializes in the development of data drivers and data access technologies for real-time access to online or on-premise applications, databases and web APIs. The company is focused on bringing data connectivity capabilities natively into tools organizations already use. It also features ETL/ELT solutions, enterprise connectors, and data visualization. Matillion ’s data transformation software empowers customers to extract data from a wide number of sources, load it into their chosen cloud data warehouse (CDW) and transform that data from its siloed source state, into analytics-ready insights – prepared for advanced analytics, machine learning, and artificial intelligence use cases. Only Matillion is purpose-built for Snowflake, Amazon Redshift, Google BigQuery, and Microsoft Azure, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet...

2022: The year of hybrid work

Remote work was once considered a luxury to many, but in 2020, it became a necessity for a large portion of the workforce, as the scary and unknown COVID-19 virus sickened and even took the lives of so many people around the world.  Some workers were able to thrive in a remote setting, while others felt isolated and struggled to keep up a balance between their work and home lives. Last year saw the availability of life-saving vaccines, so companies were able to start having the conversation about what to do next. Should they keep everyone remote? Should they go back to working in the office full time? Or should they do something in between? Enter hybrid work, which offers a mix of the two. A Fall 2021 study conducted by Google revealed that over 75% of survey respondents expect hybrid work to become a standard practice within their organization within the next three years.  Thus, two years after the world abruptly shifted to widespread adoption of remote work, we are dec...

October 2025: AI updates from the past month

OpenAI announces agentic security researcher that can find and fix vulnerabilities OpenAI has released a private beta for a new AI agent called Aardvark that acts as a security researcher, finding vulnerabilities and applying fixes, at scale. “Software security is one of the most critical—and challenging—frontiers in technology. Each year, tens of thousands of new vulnerabilities are discovered across enterprise and open-source codebases. Defenders face the daunting tasks of finding and patching vulnerabilities before their adversaries do. At OpenAI, we are working to tip that balance in favor of defenders,” OpenAI wrote in a blog post . The agent continuously analyzes source code repositories to identify vulnerabilities, assess their exploitability, prioritize severity, and propose patches. Instead of using traditional analysis techniques like fuzzing of software composition analysis, Aardvark uses LLM-powered reasoning and tool-use. Cursor 2.0 enables eight agents to work in pa...