The Scientific Method for Testing System Resilience

[ad_1]

Transcript

Yakomin: My name is Christina Yakomin. I’m a senior site reliability engineering specialist at Vanguard, one of the largest investment management companies in the world. The vast majority of our interactions with our clients happens through the web. The availability of our sites is absolutely critical to our success, and the success of our clients, the investors. This talk is called the scientific method for resilience. My goal is to teach all of you the technique that we use at Vanguard for identifying which chaos experiments to use to gain confidence in the overall resilience of our systems, and the cyclical step by step process that we follow to ensure continuous learning. This process is based on the scientific method that we learned probably all the way back in elementary school. To start things off, let’s review what that process is. It’s six steps that happen in a cycle, starting with observation. Based on what we observe, we ask questions. Then in answering those questions, we’re able to derive hypotheses. We either prove or disprove those hypotheses through experimentation, determine the results through analysis. Finally, we draw conclusions. Of course, it doesn’t stop there, the cycle repeats, because based on the conclusions that we’ve drawn, we may make new observations that lead to new questions, and so on and so forth.

How Does This Apply to Resilience?

How exactly does this apply to resilience, and more specifically, the resilience of our IT systems? At Vanguard, we’re using this adapted three-step cycle to test our systems to ensure that they are resilient. The first step in this process is the one that you’re likely least familiar with, and that’s the failure modes and effects analysis. We borrow this from the physical engineering disciplines like mechanical and hardware engineering. The idea of this meeting is to get the members of a technical team together to discuss all of the ways that a…

..

[ad_2]

Read More

About the author

The Scientific Method for Testing System Resilience – webhostingreviewsite.com