Systems Recipes - Issue #10

Systems Recipes

Hello there and Happy New Year! We’re back for a new edition of Systems Recipes 🙏

This issue is focused on all the things that go wrong when the network is not reliable. With systems sprawling across multiple machines, it’s easy to run into unintended and sometimes unexplainable behaviour when communication is intermittent or completely partitioned.

As always, we’ve included our three top pieces of reading around network failures. We hope you enjoy it!


The Network is Reliable

Network infrastructures can get extremely complex, especially when your network fabric is distributed and interconnected between regions and continents.

Data on how often these types of network failures occur are patchy at best, even to this day. The range of failures are also really wide, from typical hardware failure, misconfiguration, cable interference, shark attacks and much more.

This article by Bailis and Kingsbury goes into some case studies from companies such as Google, Microsoft and Amazon on the types of network failures they’ve seen and in some cases, how they’ve impacted the software they’ve developed as a result to be more partition tolerant.

The Network is Reliable

Network reliability matters because it prevents us from having reliable communication, and that in turn makes building distributed systems really hard. (Fallacy #1 in Peter Deutsch’s ‘Eight fallacies of distributed computing‘ is ‘The network is reliable’).

Google's Networking Incident in June 2019

Google has frequently touted the power of their petabit scale networking, building products like Google BigQuery and Spanner which eschew traditional database and networking tradeoffs when it comes to Consistency/Availability/Partition Tolerance.

In June 2019, Google had an outage across Google Cloud, GSuite and YouTube due to a network misconfiguration and software bug on their datacenter software/networking control plane. This caused the networking systems to withdraw BGP sessions and drastically reduce networking capacity.

Google Cloud Networking Incident #19009

On Sunday 2 June, 2019, Google Cloud projects running services in multiple US regions experienced elevated packet loss as a result of network congestion for a duration of between 3 hours 19 minutes, and 4 hours 25 minutes.

Whilst the incident report is light on internal system level failures, it does highlight a key failure scenario. Internal tooling (especially configuration and monitoring tools) are competing for the same networking capacity, so when there is a catastrophic failure, the ability to investigate and recover is also hampered by congestion.

Why are Distributed Systems so hard? A network partition survival guide!

This talk by Denise Yu does a fantastic job explaining why distributed systems are so hard (and as an added bonus, has lots of hand drawn cats!).

The talk takes you all the way from the beginning, explaining why systems are more and more distributed, the fallacies of distributed computing and diving into systems like RabbitMQ and Kafka, explaining how they manage the uncertainty created during a partition event.

Why are distributed systems so hard?

Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will cover a brief history of distributed databases, clear up some common myths about the CAP theorem, dig into why network partitions are inevitable, and close out by highlighting how a few popular open source projects manage the uncertainty created during a partition event.

Many systems that claim to be partition tolerant have suffered correctness when certain kinds of network failures are introduced. Jepsen by Kyle Kingsbury has a fantastic list of detailed analysis of various open source database and key/value stores which suffer correctness issues. These analyses always make for interesting reading!

Systems Recipes - Issue #9

Systems Recipes

Hey folks 👋! We hope you are keeping safe and sound. We’ve got a jam packed issue this time. We’re looking at a side of database technologies which we feel isn’t talked about as often, those that store and process analytical data.

These stores need to be able to ingest a ton of data and catalog it to be able to make data-driven inferences. We’ve chosen our pick of articles that best represent online analytical processing systems (OLAP) and hybrid transactional/analytical systems (HTAP).

As always, we’d love to hear your feedback. You reach out on Twitter or send an email 📧


Modern column-oriented database systems

Many analytical processing engines rely on column-based file storage rather than storing by row. The basic premise being that by storing data for the same column in the same set of files together, you can avoid pulling in redundant columns at query time and potentially get significantly better compression. There are many other techniques around query and join optimisation involved too.

This paper on The Design and Implementation of Modern Column-Oriented Database Systems by Adadi et al is quite a lengthy read but goes through all the in-depth architecture and implementation details. The paper also does some benchmarks comparing various column-oriented systems to their row-based counterparts.

The design and implementation of modern column-oriented database systems

The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012

BigQuery under the hood

Google runs arguably one of the largest analytical stores via their BigQuery product that is offered as a Google Cloud product.

This post takes us through the various components that underpin BigQuery, from the execution engine that takes the SQL and prepares it to be run in a parallel fashion to distribution of computation and storage across thousands of nodes and disks. Even the network comes into play, Google is able to leverage their vast networks to make shuffling huge data sets easy and fast.

BigQuery under the hood

What does it take to build something as fast as BigQuery? We dive deep to answer this question. We also talk about what BigQuery hides under the hood, leveraging technologies like Borg, Colossus, Jupiter and Dremel.

We also found this more in depth post about Capacitor (the columnar storage format used by BigQuery) to be interesting. It shows how the storage engine adapts data storage based on the actual data coming in rather than having a one-size-fits-all approach.

Inside Capacitor, BigQuery’s next-generation columnar storage format

BigQuery / Dremel engineer Mosha Pasumansky deep dives into BigQuery storage. He discusses how the new Capacitor storage engine improves efficiency.

Scaling out PostgreSQL for CloudFlare Analytics using CitusDB

If you want something like BigQuery whilst retaining control of your data, it’s worth looking at CitusDB. Citus is an extension to Postgres which distributes out processing of SQL queries to multiple Postgres nodes.

This post from Cloudflare in 2015 on scaling out their analytics infrastructure using CitusDB takes us through the process of developing the pipeline and eventually storing it in CitusDB. Cloudflare rapidly outgrew their single Postgres instance so looked into Citus to horizontally scale out their infrastructure. Citus adds in a Distributed Query Planner and Execution Engine to Postgres.

Scaling out PostgreSQL for CloudFlare Analytics using CitusDB

When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits.

We also really enjoyed this post from Microsoft about Using Citus to ingest and query Windows Diagnostics data, to the tune of 6M analytical queries per day.

Architecting petabyte-scale analytics by scaling out Postgres on Azure with the Citus extension

How does the Windows team at Microsoft assess the quality of new software builds to decide if the next software update is ready to ship? And how do they scale out Postgres on Azure—using Citus—to power a petabyte-scale analytics dashboard, analyzing 20K types of metrics from over 800M Windows devices?

The Citus Blog is also a fantastic resource to learn more about operating and scaling Postgres.

Systems Recipes - Issue #8

Systems Recipes

We’re at Issue #8 of Systems Recipes. We got some incredible feedback on the previous issue talking about databases and data storage. Many things we shared were theoretical and high level, today we want to give a few examples of the working on database systems in practice. We’ve chosen our pick of three interesting articles.

We’ve not gone as far as to consider Excel as the best place for data for this edition but we’d love to know if there are other critical flows running on Microsoft Excel.


The network is reliable

Kyle Kingsbury has been steering the ship of analysing distributed systems (especially stateful stores) and providing case studies and tools to identify failure scenarios which could lead to data corruption or data loss. If you haven’t checked out Kyle’s Jepsen series of posts, we strongly recommend it!

Whilst this post isn’t strictly about any particular database in general, it highlights many practical documented examples of how processes, servers and networks can fail and what the ultimate consequences look like.

If you enjoy this one, you’ll also enjoy Coda Hale’s post ’You Can’t Sacrifice Partition Tolerance’.

The network is reliable

This post is meant as a reference point–to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments […] Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance.

A 10x reduction in Apache Cassandra tail latency

Instagram uses a lot of Cassandra as a key / value store to operate features such as the feed and the messaging features. They found that a lot of time was being spent in Garbage Collection which led to spikes in latency.

Instagram replaced the Cassandra storage engine with RocksDB and saw a 10x reduction in tail latency. This involved decoupling the ingrained existing storage layer from Cassandra’s functionality to make it more pluggable for RocksDB to slot in.

Open-sourcing a 10x reduction in Apache Cassandra tail latency | Instagram Engineering

Instagram maintains a 5–9s reliability SLA, which means at any given time, the request failure rate should be less than 0.001%. For performance, we actively monitor the throughput and latency of different Cassandra clusters, especially the P99 read latency.

Scaling PostgreSQL database to 1.2bn records/month

Gajus starts off with looking into some hosted cloud offerings before running self-hosted Postgres. There are some really practical take-aways on picking the right schema design and tuning Postgres flags to get the best database performance. The post also has a really interesting tale about where Postgres may not be the best fit as a concurrent job queue.

If you use NodeJS, the subsequent post on Processing large volumes of data safely and fast using Node.js and PostgreSQL is really interesting too!

Lessons learned scaling PostgreSQL database to 1.2bn records/month | Gajus Kuizinas

Choosing where to host the database, materialising data and using database as a job queue

Systems Recipes - Issue #7

Systems Recipes

Welcome to Issue #7 of Systems Recipes. We’re going down a persistent route today and talk about the art of storing data reliably and durably. Specifically, let’s talk about databases!

Instead of giving three links, we’re going to switch it up a little and talk about each of the things we find fascinating in three categories of databases. This includes various links to papers and articles. We hope you enjoy it 🙏


Relational Databases

Relational Databases have been around since the 70s and entire companies and industries have been built on the relational model of storing data.

Many concepts proposed originally by Codd, a researcher at IBM who published “A Relational Model of Data for Large Shared Data Banks” are relevant today in our conceptual thinking.

Databases are complex systems under the hood, involving many components that work together to make sure your data is durably stored. “Architecture of a Database System” takes you through all the components involved and how they fit in to the puzzle.

Sometimes, hardware or low level kernel implementations can have drastic (and sometimes very negative) consequences on durability. Postgres had a fascinating bug where an expectation of the fsync system call turned out to be completely the opposite. A really interesting relevant paper was presented in USENIX 2020 titled “Can Applications Recover from fsync Failures?”.

NoSQL and Key/Value Stores

NoSQL databases and Key Value Stores are a huge domain and have been synonymous with making data storage more scalable and highly available. A paper from Amazon titled “Dynamo: Amazon’s Highly Available Key-value Store” was the catalyst in this massively growing area of research. It influenced key systems like Cassandra and DynamoDB.

Many folks saw NoSQL as an answer to all their scaling woes. However, you can’t just take a relational schema and expect it work and scale in a NoSQL system. This excellent talk by Rick Houlihan talks about DynamoDB Advanced Design Patterns. This talk is well worth a watch, even if you don’t use DynamoDB.

Google published a paper (and a cloud service of the same name) on “Spanner: Google’s Globally-Distributed Database”. It leverages key infrastructure components like Google’s network and storage layers as well as time API called TrueTime to support externally-consistent distributed transactions.

Analytical Processing

When designing a performant data store, you want to minimize things like Disk Seeks and thus you want to store your data accordingly. If your workload often involves aggregating across columns (for example, in analytics workloads when you want to find an average across a column), it makes more sense to structure your data on disk as such.

These systems are often touted as data warehouses or systems geared for analytics processing (in contrast to transactional real-time processing).

C-Store: A Column-oriented DBMS was very influential in this field, it proposed the idea of storing data in a columnar format. This has significant advantages in being able to compress the data much more efficiently as well as making queries cheaper by only reading the columns needed for each query (as opposed to reading all the columns in a row oriented data store).

This paper by Google on “Dremel: Interactive Analysis of Web-Scale Datasets” talks about Google’s massively parallel database to process multiple terabytes of data is seconds by farming out queries to thousands of workers.

Systems Recipes - Issue #6

Systems Recipes

Welcome to Issue #6. We hope everyone is keeping safe and sound and washing your hands regularly.

We are dedicating this issue to networking. Networking has evolved a lot in the last decade and is undoubtedly one of the key parts of running systems. We will primary dig into load balancing this week.

If you have read or watched anything interesting lately that you think will be a good fit for future issues or have any feedback, simply hit reply. You can also reach out on Twitter or send an email! 📧


Introduction to modern network load balancing and proxying

This a comprehensive introduction by Matt Klein, the creator of the Envoy Proxy on how load balancing has evolved over the years and how modern systems approach it. It also offers an excellent comparison of L4 vs L7 load balancing and the tradeoffs involved in each of them.

Introduction to modern network load balancing and proxying | by Matt Klein | Envoy Proxy

It was brought to my attention recently that there is a dearth of introductory educational material available about modern network load balancing and proxying. I thought to myself: How can this be…

Internet-Scale Loadbalancing

This talk from SRECon 2019 Americas tries to demystify load balancing at Google scale tracing the path of a single packet as it makes it journey through various layers in the stack. It also talks about various load balancing techniques and the tradeoffs involved in each of them.

Keeping the Balance: Internet-Scale Loadbalancing Demystified | USENIX

Can you explain the entire path that an IP packet takes from your users to your binary? What about a web request? Do you understand the tradeoffs that different kinds of load balancing techniques make? If not, this talk is for you.

Dropbox's Migration from Nginx to Envoy

Dropbox conducted a large scale migration of their traffic infrastructure from Nginx to Envoy Proxy. This system handled millions of requests per second and had to be migrated seamless without downtime.

Dropbox’s motivation for the switch was to improve performance, maintainability and observability. The post goes into detail about the migration process, the troubles Dropbox faced during the process and the benefits they gained once the process was complete.

How we migrated Dropbox from Nginx to Envoy - Dropbox

In this blogpost we’ll talk about the old Nginx-based traffic infrastructure, its pain points, and the benefits we gained by migrating to Envoy. We’ll compare Nginx to Envoy across many software engineering and operational dimensions. We’ll also briefly touch on the migration process, its current state, and some of the problems encountered on the way.

Loading more posts…