Systems Recipes - Issue #9
Hey folks 👋! We hope you are keeping safe and sound. We’ve got a jam packed issue this time. We’re looking at a side of database technologies which we feel isn’t talked about as often, those that store and process analytical data.
These stores need to be able to ingest a ton of data and catalog it to be able to make data-driven inferences. We’ve chosen our pick of articles that best represent online analytical processing systems (OLAP) and hybrid transactional/analytical systems (HTAP).
Modern column-oriented database systems
Many analytical processing engines rely on column-based file storage rather than storing by row. The basic premise being that by storing data for the same column in the same set of files together, you can avoid pulling in redundant columns at query time and potentially get significantly better compression. There are many other techniques around query and join optimisation involved too.
This paper on The Design and Implementation of Modern Column-Oriented Database Systems by Adadi et al is quite a lengthy read but goes through all the in-depth architecture and implementation details. The paper also does some benchmarks comparing various column-oriented systems to their row-based counterparts.
The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012
BigQuery under the hood
Google runs arguably one of the largest analytical stores via their BigQuery product that is offered as a Google Cloud product.
This post takes us through the various components that underpin BigQuery, from the execution engine that takes the SQL and prepares it to be run in a parallel fashion to distribution of computation and storage across thousands of nodes and disks. Even the network comes into play, Google is able to leverage their vast networks to make shuffling huge data sets easy and fast.
What does it take to build something as fast as BigQuery? We dive deep to answer this question. We also talk about what BigQuery hides under the hood, leveraging technologies like Borg, Colossus, Jupiter and Dremel.
We also found this more in depth post about Capacitor (the columnar storage format used by BigQuery) to be interesting. It shows how the storage engine adapts data storage based on the actual data coming in rather than having a one-size-fits-all approach.
BigQuery / Dremel engineer Mosha Pasumansky deep dives into BigQuery storage. He discusses how the new Capacitor storage engine improves efficiency.
Scaling out PostgreSQL for CloudFlare Analytics using CitusDB
If you want something like BigQuery whilst retaining control of your data, it’s worth looking at CitusDB. Citus is an extension to Postgres which distributes out processing of SQL queries to multiple Postgres nodes.
This post from Cloudflare in 2015 on scaling out their analytics infrastructure using CitusDB takes us through the process of developing the pipeline and eventually storing it in CitusDB. Cloudflare rapidly outgrew their single Postgres instance so looked into Citus to horizontally scale out their infrastructure. Citus adds in a Distributed Query Planner and Execution Engine to Postgres.
When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits.
We also really enjoyed this post from Microsoft about Using Citus to ingest and query Windows Diagnostics data, to the tune of 6M analytical queries per day.
How does the Windows team at Microsoft assess the quality of new software builds to decide if the next software update is ready to ship? And how do they scale out Postgres on Azure—using Citus—to power a petabyte-scale analytics dashboard, analyzing 20K types of metrics from over 800M Windows devices?
The Citus Blog is also a fantastic resource to learn more about operating and scaling Postgres.