Twitter went down today and here’s an insightful comment on HN:
I think this is another good example of how we as an industry are still unable to adequately assess risk properly.
I’m fairly certain that the higher-ups in Twitter weren’t told “We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down.” Whoever was in charge of disaster recovery obviously didn’t really understand the risk.
Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a “16-sigma event”, it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn’t failover properly, because of a misconfiguration.
You would think after decades that there would be a better analysis and higher-quality “best practices”, but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don’t understand risk, and that there is an opportunity for “consultants” who understand this, kind of like security consultants?
The problem, of course, is that risk management will forever be in its infancy relative to risk-generating processes. That’s because the things that cause risk are where the money is made.
Arnold Kling has a nice way of summarizing a more enlightened approach to risk management: “make things easier to fix rather than harder to break”. Nassim Taleb would phrase the downside to making things harder to break as ‘fragility’.
Consider nuclear technology: fantastic if the risks are managed properly. But the downside to a problem is just so immense it may be the case that there is NO SUCH THING as an adequate safety system.
Remember what happened at Chernobyl, where the big accident happened during a rather benign systems test.
To get accurate results from the test, the operators turned off several of the safety systems, which turned out to be a disastrous decision.
They deliberately turned the systems off. Model that!