Systems Recipes - Issue #8

Systems Recipes

We’re at Issue #8 of Systems Recipes. We got some incredible feedback on the previous issue talking about databases and data storage. Many things we shared were theoretical and high level, today we want to give a few examples of the working on database systems in practice. We’ve chosen our pick of three interesting articles.

We’ve not gone as far as to consider Excel as the best place for data for this edition but we’d love to know if there are other critical flows running on Microsoft Excel.

The network is reliable

Kyle Kingsbury has been steering the ship of analysing distributed systems (especially stateful stores) and providing case studies and tools to identify failure scenarios which could lead to data corruption or data loss. If you haven’t checked out Kyle’s Jepsen series of posts, we strongly recommend it!

Whilst this post isn’t strictly about any particular database in general, it highlights many practical documented examples of how processes, servers and networks can fail and what the ultimate consequences look like.

If you enjoy this one, you’ll also enjoy Coda Hale’s post ’You Can’t Sacrifice Partition Tolerance’.

The network is reliable

This post is meant as a reference point–to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments […] Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance.

A 10x reduction in Apache Cassandra tail latency

Instagram uses a lot of Cassandra as a key / value store to operate features such as the feed and the messaging features. They found that a lot of time was being spent in Garbage Collection which led to spikes in latency.

Instagram replaced the Cassandra storage engine with RocksDB and saw a 10x reduction in tail latency. This involved decoupling the ingrained existing storage layer from Cassandra’s functionality to make it more pluggable for RocksDB to slot in.

Open-sourcing a 10x reduction in Apache Cassandra tail latency | Instagram Engineering

Instagram maintains a 5–9s reliability SLA, which means at any given time, the request failure rate should be less than 0.001%. For performance, we actively monitor the throughput and latency of different Cassandra clusters, especially the P99 read latency.

Scaling PostgreSQL database to 1.2bn records/month

Gajus starts off with looking into some hosted cloud offerings before running self-hosted Postgres. There are some really practical take-aways on picking the right schema design and tuning Postgres flags to get the best database performance. The post also has a really interesting tale about where Postgres may not be the best fit as a concurrent job queue.

If you use NodeJS, the subsequent post on Processing large volumes of data safely and fast using Node.js and PostgreSQL is really interesting too!

Lessons learned scaling PostgreSQL database to 1.2bn records/month | Gajus Kuizinas

Choosing where to host the database, materialising data and using database as a job queue