I’ve been meaning to write this essay for a few weeks but I was afraid that people might think that I had gone crazy just by reading the title. After all, you can’t be very sane if you are talking about the fragility of the cloud. For many, cloud platforms are the most robust architecture ever built after the internet. And yet, they are still fragile.
In recent weeks, we have witnessed how a deployment error on the S3 service spread throughout regions of the AWS infrastructure drastically affecting a significant part of the internet. A few days after, Microsoft Azure suffered a similar (although not as catastrophic) outage that affected a large number of its customers. Those examples showed how random, unpredictable events can have disastrous consequences for cloud infrastructures. That’s the definition of fragility.
As Big as the Internet
The fragility of the cloud can be compared to the fragility of banking systems. Both architectures are incredibly robust and pervasively adoptive to the point we can’t conceive our lives without it. Yet in 2008 the US financial systems erased over $2 trillion in assets and came on the verge of collapse after irresponsible bets in sub-prime mortgages permeated through the system. Just a decade before, Long Term Capital Management(LTCM) was crushed by Russia’s unexpected decision to default on its debt and threaten to cause major losses to the US financial systems as most big banks has trading relationships with LTCM. Just as I write these lines, Italy is struggling with the bailout of its oldest bank to try to avoid the collapse of its finance infrastructure.
Coming back to the cloud, the analogy of the fragility of banking systems could help us explain a few concepts. Just like banks, platforms such as AWS are so widely adopted that they can be considered a second internet. As a result, a large percentage of the world’s online presence requires these underlying cloud platforms to exhibit near perfect robustness. The inverse is also true (although more scarier) an unexpected failure on cloud infrastructures can have disproportional consequences in online business across the world. The thing is that, when comes to centralized, worldwide adopted, complex and incredibly interconnected software infrastructures such as the cloud: robust is never robust enough.
It is obvious that cloud platforms such sa AWS and Azure are robust. The might ran among the most robust software infrastructure ever created. However, they are also incredibly large and complex which exponentially increases the impact of a potential failures. They are simply fragile. The type of fragility we are talking about in this article is not the one you can address with more robust infrastructures and chaos monkey testing techniques. We are talking about resiliency to random, unexpected events: A human deployment error spreads through the AWS cloud, Chine decides to expel Amazon from the country and cuts access to AWS datacenter, a conflict emerges between North and South Korea destroying cloud data centers in South East Asia; malicious Russian hackers( all hackers seem to be Russian these days ;) ) discover a vulnerability that can bring down core Azure services…Hopefully you get the idea.
What it is the right balance between cost and architecting systems that are “more resilient” ( by definition we can’t plan for many of those events) to those scenarios? Most businesses can live knowing that, from time to time, they will be affected by cloud disasters. Fro others, that’s simply not an option.
In a recent Wall Street Journal conference, Andreessen-Horowitz partner Peter Levine casually mentioned that he had been working on a thesis of a world on which cloud computing goes away. Levine’s argument was related to the emergence of decentralized, edge computing platforms that can’t simply be exposed to the fragility of the cloud. While I don’t subscribe to that theory (yet ), I believe it includes some ideas worth exploring in order to build systems more resilient to the fragility of the cloud. More about that in a future post.