Systems Recipes - Issue #2

Systems Recipes

Welcome to the second issue of Systems Recipes! This is a newsletter about Software and Systems Architecture. We aim to bring you stories that inspire us to be better at our craft.

Every two weeks, we’ll send our top 3 picks that we’re reading at the moment. 🤓 This edition has a bonus research paper at the end which we really enjoyed reading so you get 4 things for the extra long weekend.

We’ve been overwhelmed with the support at launch. 💖 Thanks to each and every one of you. 🙏

If you have read or watched anything interesting lately that you think will be a good fit for future issues or have any feedback, simply hit reply. You can also reach out on Twitter or send an email! 📧

Cloud Jewels: Estimating kWh in the Cloud - Etsy

Datacenters require a large amount of power to keep all those servers up and running optimally. Have you ever thought about how much power is needed to stream Tiger King on Netflix to everyone globally or all the machine learning required to curate your Instagram feed?

Most cloud providers have made big investments in optimizing datacenter efficiency. Both Amazon Web Services and Google Cloud have substantial efforts in achieving 100% renewable energy use.

As engineers developing for the cloud, it’s easy to lose sight of growing energy use to serve a growing customer base. Etsy talk about their efforts to calculate watt-hour coefficients to estimate energy use across a vCPU and per terabyte of HDD/SSD storage use.

If this is an area you have experience in or want to contribute, they are also looking for more input/data to also estimate costs of volatile memory and inter-datacenter networking.

Cloud Jewels: Estimating kWh in the Cloud - Code as Craft

Writing a Time Series Database from Scratch - Fabian Reinartz

Monitoring a dynamic environment (such as applications running on Kubernetes) can result in very high cardinality of series. You get a high amount of series churn where some series become inactive and new ones come online as applications get deployed.

This post by Fabian from 2017 shows how Prometheus’ data model has been curated to allow for speedy querying using mmap with lots of tiny chunk files backed by an index and really fast ingestion (by building on the concept of the Write Ahead Log popularized by LevelDB / Cassandra).

If you want more on Prometheus specifically, the documentation has a great section on Local Storage. If you are interested in the file format itself, check out the TSDB documentation.

Writing a Time Series Database from Scratch | Fabian Reinartz

Building a more accurate time service at Facebook scale - Facebook

Keeping time accurate is a challenging necessity. If you use a distributed datastore which relies on time, a bad time source can lead to an incorrect ordering of transactions which is not fun. 😓

Facebook detail how they converted the analog signal from their atomic clocks into a digital pulse all the way into the jitter differences they saw between ntpd and chronyd. They managed to get the jitter down from tens of milliseconds to sub-millisecond.

Facebook has made their time service public (see the ‘Public NTP design decisions’ section) with different geographic network paths so we can all benefit. If you are on AWS, you can use the AWS Time Sync service . Google also has an open NTP service backed by atomic clocks.

NTP: Building a more accurate time service at Facebook scale

Meaningful Availability - Tamas Hauer et al.

We have an added bonus this week! Say you are building an online store, everything looks great and you have 99% availability. It turns out, that 1% missed is everyone being able to check-out and pay for their goods from your store.

An availability metric needs to be useful for both end-users and engineers. As aptly described in the paper ‘A good availability metric should be meaningful, proportional, and actionable’. Having probes which replicate functionality at a high level won’t give you an objective measure of availability.

This paper goes beyond using a simple success/failure ratio metric into a metric called 'windowed user-uptime’ to encompass what each user experiences into your availability metric.

Meaningful Availability | USENIX