Before Testing Resiliency

In Big Data/Web Scale website operations, resiliency testing is getting a lot of attention with updated methods. Yes, traditional software unit and integration testing are still key to developing reliable software. A growing number of companies taking a more active approach to testing resiliency. They actually induce failures that test a live system’s reaction. Amazon, Google, and Etsy use GameDay exercises. Netflix employs a Simian Army.

I can see the reaction of most broadcasting managers if you were to propose GameDay testing an air chain. “Wait, are you asking if you can intentionally take us off the air to see if you built the thing right?” There would probably be some expletives sprinkled in there. We will get to GameDay testing as we rely less on real-time broadcast distribution and more on other distribution mechanisms.

In the meantime, most of us probably need to take a step or two back. We need to remind our managers that Operations and Engineering need to be drilling on our resiliency procedures. If we do not practice our response to a dead Master Control switcher, a production audio console lock-up, or a MAM server crash, how will we be able to respond instinctively to a real failure in a way that minimizes the impact? Even more problematically, we will probably ignore our planned procedures, making up new ones as we go, and end up making the situation much worse before it gets better. You will have longer downtime and the outage will have more noticeable effects on your viewers.

Remember, before you test your resiliency procedures, you need to practice them.