Systems Recipes - Issue #10
Hello there and Happy New Year! We’re back for a new edition of Systems Recipes 🙏
This issue is focused on all the things that go wrong when the network is not reliable. With systems sprawling across multiple machines, it’s easy to run into unintended and sometimes unexplainable behaviour when communication is intermittent or completely partitioned.
As always, we’ve included our three top pieces of reading around network failures. We hope you enjoy it!
The Network is Reliable
Network infrastructures can get extremely complex, especially when your network fabric is distributed and interconnected between regions and continents.
Data on how often these types of network failures occur are patchy at best, even to this day. The range of failures are also really wide, from typical hardware failure, misconfiguration, cable interference, shark attacks and much more.
This article by Bailis and Kingsbury goes into some case studies from companies such as Google, Microsoft and Amazon on the types of network failures they’ve seen and in some cases, how they’ve impacted the software they’ve developed as a result to be more partition tolerant.
The Network is Reliable
Network reliability matters because it prevents us from having reliable communication, and that in turn makes building distributed systems really hard. (Fallacy #1 in Peter Deutsch’s ‘Eight fallacies of distributed computing‘ is ‘The network is reliable’).
Google's Networking Incident in June 2019
Google has frequently touted the power of their petabit scale networking, building products like Google BigQuery and Spanner which eschew traditional database and networking tradeoffs when it comes to Consistency/Availability/Partition Tolerance.
In June 2019, Google had an outage across Google Cloud, GSuite and YouTube due to a network misconfiguration and software bug on their datacenter software/networking control plane. This caused the networking systems to withdraw BGP sessions and drastically reduce networking capacity.
Google Cloud Networking Incident #19009
On Sunday 2 June, 2019, Google Cloud projects running services in multiple US regions experienced elevated packet loss as a result of network congestion for a duration of between 3 hours 19 minutes, and 4 hours 25 minutes.
Whilst the incident report is light on internal system level failures, it does highlight a key failure scenario. Internal tooling (especially configuration and monitoring tools) are competing for the same networking capacity, so when there is a catastrophic failure, the ability to investigate and recover is also hampered by congestion.
Why are Distributed Systems so hard? A network partition survival guide!
This talk by Denise Yu does a fantastic job explaining why distributed systems are so hard (and as an added bonus, has lots of hand drawn cats!).
The talk takes you all the way from the beginning, explaining why systems are more and more distributed, the fallacies of distributed computing and diving into systems like RabbitMQ and Kafka, explaining how they manage the uncertainty created during a partition event.
Why are distributed systems so hard?
Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will cover a brief history of distributed databases, clear up some common myths about the CAP theorem, dig into why network partitions are inevitable, and close out by highlighting how a few popular open source projects manage the uncertainty created during a partition event.
Many systems that claim to be partition tolerant have suffered correctness when certain kinds of network failures are introduced. Jepsen by Kyle Kingsbury has a fantastic list of detailed analysis of various open source database and key/value stores which suffer correctness issues. These analyses always make for interesting reading!