“In all chaos, there is a cosmos, in all disorder a secret order.”
A Chaos Engineer at Gremlin, Ana helps companies avoid outages by running proactive chaos engineering experiments. At Reactive Summit in Montreal this October, Ana offers a deep-dive into the world of chaos engineering with her talk “The practice of Chaos Engineering”.
In anticipation of Ana’s talk at Reactive Summit, we spoke to Ana about her developer journey, the importance of reliability and how breaking things on purpose can help organizations build more reliable systems.
What is your background and what sparked your interest in chaos engineering and cloud computing?
I began writing code at the age of 13. There was something fun about building things out of nothing, and it enabled me to teach myself a few languages early on. I did a lot of work on web and mobile development and then transitioned to the Infrastructure world. I stumbled upon Cloud Computing and Chaos Engineering when I first started working at Uber on their Site Reliability Engineering team focusing on their Chaos Engineering tool, uDestroy. I then transferred to Uber’s Cloud Infrastructure team where I worked on building a tool for bringing Uber to the Cloud using GCP and AWS. I rapidly learned that working on internal infrastructure tools was interesting to me. I found the value in building services, especially when it meant keeping a company like Uber, with thousands of microservices, reliable. I’m now working at Gremlin as a Chaos Engineer, helping companies avoid downtime by proactively running chaos engineering experiments.
What problems do you solve as a part of your job?
Working at a small startup means the type of problems I work on are constantly changing. As a pioneer in the chaos engineering space, I get to help others see the ways their infrastructure can break at any moment, so they should break it on purpose to build more resilient systems. My favorite part of my job is being able to learn about the infrastructure of different companies, their struggles with the scalability of their microservices.
Reactive is a new buzzword for many traditional developers. What is your prediction for its importance in application development over the next couple of years?
I strongly believe in making reliability one of the first core things to think about when starting to develop a service, application or company. Reactive makes that a priority and I believe reactive development is only going to get more prominent in companies.
What is the biggest challenge companies deploying distributed Reactive systems are facing?
I’m a bit biased in this question as I spend most of my time helping others learn how to use chaos engineering to build more resilient systems. Resiliency is extremely hard, especially when in microservice architectures or when running anything at scale. The last thing any company wants is to have any sort of downtime or failure in their systems. A multi-day outage is unacceptable.
What is the best solution to this challenge?
Break things on purpose is what we like to say Gremlin. Chaos Engineering is the practice of thoughtfully planned experiments designed to reveal the weakness in our systems. There are too many points of failures in applications these days, but you can inject chaos at any layer.
What is your most ambitious professional dream that you hope to achieve one day?
I’ve had a few, but I think the most ambitious one is to one day become president of Costa Rica.
Who should attend your talk and what will they learn?
My talk is open to engineers of all levels, though beginners and intermediate engineers would enjoy it best. They will learn what chaos engineering is, why it’s important and how to get started.
Whom would you like to connect with at the conference?
Everyone and anyone! It’s my first time interacting with the Reactive Systems world and I feel like it’s a good conference to network and learn.