Chaos Engineering — Path to Reliability

Rahul Ambhore
3 min readApr 4, 2021

As the application user base and geography increase and more and more features getting added with DevOps practices. It is more than critical to validate our assumption about our theory of reliability.

Linkedin

Chaos Engineering is the methodology to bring more reliability in a controlled manner. as it is just one more paradigm just like DevOps to improve applications with superior user experience.

It is important to plan>Implement>monitor >feedback to get most of the chaos engineering. you can not simply stop prod VM or delete SQL database without having a plan to recover and without calculating blast radius.

Chaos Engineering is also complete Teamwork and without onboarding, the core team can not be implemented, you need to onboard stateholder>dev>ops>sre for successful validation of reliability.

The key thing to know to understand before starting or thinking about chaos engineering is to get depth understanding of your Architecture, Monitoring stack.

As we start with low environment and gradually move towards higher environment, in some of the industry is would not be feasible immediately to implement chaos engineering in production like health care or financial institutes. you can not simply stop an ambulance that supposes to reach the accident place to bring the patient to a hospital while you are doing chaos engineering. so you do need to do your match and understand your maturity.

Stakeholders buying into the concept of chaos engineering, but not investing in it. have everyone involved in the project and not just practice to make your system/stack more reliable. There would be some impact on the team positive or negative of this practice.

Credit to PluralSight

We often heard about your application's on-call support, we have been given just pager duty and go, this is just running out of the problem but not solving it. your on-call eng needs to have proper training. As we go through any reliability issue every second counts and if we consume time in just getting needed folks on call will add more pain to your customers…

Future work always takes priority, reliability is “Technical Debt”, high quality distributed system add a competitive advantage.

Fighting your company culture: This would be a critical issue with the bigger enterprise. if we purpose this idea and someone just says “We already have so much chaos” this would be an interesting point to note and solve. the mindset needs to be aligned.
Fear and Blame cause hesitation so each instance of chaos engineering needs to be looked at as an opportunity to improve reliability instead blamegame.

Get Started:
Scan: Assess what could go wrong, evaluate risk
Baseline: Understand steady state. Check for observability gaps
Analyse and Plan: baseline + Risk=> Checklist

Tools to get started:

Azure Studio — for Azure related workload and more

Chaos Monkey — Netflix introduced

Gremlin- This one is much more mature and supports cloud and most of the on-prem use cases.

--

--