Skip to main content
  1. Posts/

Designing Data Intensive Applications - Storage and retrieval (part 1)

·15 mins
Architecture Backend Design Software
Clément Sauvage
Clément Sauvage
I like to make softwares
Designin Data Intensive Applications Book notes - This article is part of a series.
Part 3: This Article

Chapter 3: Storage and retrieval #

There is a big difference in storage engines that are optimized for transaction processing and those that are optimized for analytics.

Data structures that power your database #

Many databases use a log to store the changes to the data. The log is an append-only data structure, which is an ordered sequence of records.

In order to find efficiently a value for a particular key in our database, we need a different daya structure called an index. Roughly explained, an index is a a data structure that will be stored on the side, and will allow you to locate the data you are looking for more efficiently. If we want to search our data in different in several ways, we may need several different types of indexes.

As an index is an additional structure that is derived from the primary data, removing it won’t affect the content of the database, but it will affect the performance of the queries. Maintaining an index incures overheads on the write operations (as the index must be updated). As this is one of the most important tradeoff in database design, most databases don’t index everything by default, but let the developer/db admin decide what index to create.

Hash indexes #

A hash index is very useful for key-value stores. It can almost be interpreted as a hashmap (or hashtable), which is a data structure for the in memory data. If our data store consisted of a file on which we can only append data, the simplest possible index would be an in memory hash-map where every key is mapped to the offset in the file where the value is stored. Although it sounds simplistic, it is a viable approach used by Bitcask. Such in momory hash-map is well suited to situations where the value for each key is updated frequently. It is also very fast for reads.

One way to avoid to run out of disk space in our “append only to a file approach” would be to split the file into segments, and when a segment reaches a certain size, we can start writing to a new segment. We can then keep a hash-map in memory that maps keys to the segment in which they are stored. This is the approach used by LSM-trees. We can also use compaction. It is a process that rewrites the file, removing the old data that is no longer needed. Basically, compaction will override (usually several times) keys that have been updated or deleted, and will remove the old data from the file. We can then merge the segments that have been compacted, and remove the old segments. Lots of details goes into the improvement of simples ideas like this, for instance:

  • File format: The binary format being faster and simpler for our usecase.
  • Deleting records: In a log based storage, we would append a deletion record (called tombstone). The tombstone will be used to mark the record as deleted, and will be removed during compaction.
  • Crash recovery: If the database is restarted, the in-memory hash maps are lost. We could restore each segment’s hash map by reading them, but it would be slow. So one technique would be to speed ups recovering by storing snapshots of each segment’s hash map on disk, which can be loaded in memory quickly.
  • Partially written records: Using checksums allows to detect partially written records (crash happending halfway through appending record to the log), and to recover from them.
  • Concurrency control: If we have multiple clients writing to the database, we need to ensure that the writes are atomic. We can use locks, but it is not very efficient. We can also simply have only one writer thread.

The multiple overheads caused by the fact that we are using a write only log instead of overriding it at places, are the followings:

  • Appending and segment mergnig are sequential write operations. They are much faster generally speaking than random writes. Especially on HDD, and still preferable on flashed based storage SSD.
  • Concurrency and crash recovery are more simple if segment files are append-only and immutable. As, for example, it would be hard to recover a segment if a write operation was interrupted halfway through.
  • Merging old segment avoids the issue of data files getting framgented over time (the data being too spread out in on the disk, which would slow down the read operations).

Some limmitation of our hash table index would be:

  • The hash table must fit in memory.
  • Range queries are not efficient, as the keys are not ordered.

SSTables and LSM-Trees #

A variation of our log-structured sequential segment of key-value pairs would be a SSTables. The main change would be that we would store the key-value pairs in a sorted order. We also require that each key only appears once in each segment. The main advantages of this format would be:

  • Merging segments is simpler, as we can merge them in a single pass, without having to keep the entire segment in memory. We can read the segments side by side to merge them, just like in mergesort.
  • In order to find a specific key in the file, we no longer need to keep all the keys in memory. We can now keep an in-memory index of only a few keys, which can be used to locate our key easily then (finding if we have the location of the keys “2000” and “2100”, we can infer that “2030” must be located in between)
  • We can compress the blocks of content that are in between the records we keep in our index. This reduces disk usage and I/O bandwidth.

One big issue at first glance is to keep the data sorted in the first place, at insertion. Keeping a data structure ordered is simpler in memory, but still possible on disk. There are many different data structures that we can use such as red-black trees or AVL trees. We will procceed the following way:

  • When we want to insert a key-value pair, we will first write it to an in-memory buffer, usually called memtable (in an efficient data structure like the red-black tree).
  • When the buffer reaches a certain size, we will write it to disk as a new segment.
  • Now, to fullfuil a read request, we try to read from our memtable. Then, if not found, we can try to read from the past segments.
  • From time to time, we need to run mergings and compaction in the background.

The only issue this scheme suffer from is that on any outage we loose the memtable since they were not written on disk in time.

The whole schema is basically a Log-Structured Merge tree.


LSM trees are a type of data structure used for efficient storage and retrieval of data in database systems, particularly in systems like LevelDB, RocksDB, and Apache Cassandra. They are optimized for write-heavy workloads, and are able to perform both random and sequential writes efficiently.

LSM trees can be slow when looking up for a key that does not exists in the db. This would need to look in the memtable, then into all the segments, all the way from the most recent to the oldest. Lots of optimizations goes into optimizing a storage engine. For the case of a missing tree in our storage implemented over LSM trees, we can for instance use Bloom filters, which are a probabilistic data structure that can tell us whether a key is definitely not in the database, or that it may be in the database. There are many other specificities and optimization that are applied in storages based on LSM trees, but they all keep the main idea of the LSM tree: keeping a cascade of SSTables that are merged in the background. It is simple and effective. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

B-Trees #

The log structured index discussed so far are gaining popularity, but B-Trees are still the most common indexing structure in databases. It has been for years the standard index implementation in most relational databases and many non relational ones. Like SStables, B-Tree keep key-value pairs sorted by key, which makes them efficient for lookups and range queries. The log-structured indexes we saw earlier break the database down into variable-size segments, typically several megabytes or more in size, and always write a segment sequentially. By contrast, B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

Pages can be looked up with an address, so pages can refer to each other, similarly to in memory, but on disk. We use these page references to build a tree structure, with the root at the top and the leaves at the bottom. Each page has a range of keys, and the keys in the tree are kept in sorted order within each page. This data structure is similar to a binary search tree, but where each node would be a page on disk (ordered array of values). The tree is balanced, so the number of pages that you need to read in order to find a key is logarithmic.

The number of references in a page is called the branching factor of the tree. The higher the branching factor, the shallower the tree, and the fewer pages need to be read. The branching factor is typically several hundred, so even for a very large number of keys, the tree is only about four or five levels deep. The values are stored in the leaf pages, and the leaf pages are linked in a linked list, so that you can iterate over all keys in order. When we want to update the value for an existing key, we need to find the leaf page containing that key. We can then update the value in place in the page. The references to the page stay the same. When we want to add a new key, we first have to find the leaf page where this value would be encompassed, and add the new key to the page if there is enough free space. Otherwise, the leaf page is split into two half-full pages, and the parent page is updated to account for the new key ranges.

The main tree remains balanced thanks to its algorithm. It will always have a depth of O(log n). Most databases can fit into a 3 or 4 level deep B-Tree, which means that the number of disk reads needed to find a key is very small. A 4 level tree of 4 KB pages with a branchnig factor of 500 can store up to 256 TB.

The cases where we split a page due to an addition, can be pretty dangerous. We have to write the two new pages but also overwrite their parent pages to update the references. If the database crashes after only some of the pages have been written, we would end up with a corrupted index. To avoid that, B-tree’s typical implementation use an additional data structure called a write-ahead log (WAL), an append only file on disk, to which every modification of the B-tree must be written befor it is applied to the pages of the tree itself. This way, if the database crashes, we can recover the B-tree by replaying the log.

Comparing B-Tree and LSM-Trees #

As a rule of thumb, LSM-Trees are typically faster for writes, whereas B-Trees are faster for reads, especially if the whole index fits in memory.

B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page (can be more writes on pages split). LSM tree may also rewrites data multiple times dut to reapeated compaction and merging of SSTables. The effect of one write ending up being multiple writes on disk is called write amplification. In write-heavy applications, the write amplification can be a big problem, as it can slow down the write operations. The more writes per second the database is facing, the less it can handle. LSM-trees are typically more sustainable for higher write thoughput, partially due to a lower write amplification than B-Trees. LSM-Tree also are compressed better, and offer smaller files on disk. B-tree storage engine leave some space on disk unused due to fragmenatation (when a page is split, we leave free space in the new pages).

However, in LSM-trees, the compaction process can sometimes interfere with the performance of ongoing reads and writes. Also, at a write heavy thoughput, the disk’s finite write bandwidth needs to be shared between the writes and the compaction, and since the latter becomes heavier overtimes, the write thoughput may decrease. But if the write thoughput is hight and the compaction is not configured accordinly, it may not keep up with the incoming writes, and the disk will fill up. In B-tree, each key exists in exactly one place in the index, whereas in LSM-trees, a key may exist in multiple segments, and the most recent value is the one that is considered correct.

Other indexing structures #

So far, we only discussed key-value indexes, they are like a primary key index in a relational database, that is the key that uniquely identify a row in a RDB, a document in a document DB, or a vertex i a graph DB. A secondary index, would be an additional data structure that we decide to create on any other field that is not the primary key. We can usually create as much secondary indexes as we want. The best thing to do is obviously to create the ones that would speed up the read queries we perform the mosts. But be aware that they obviously adds up work on writes. The main difference when building a secondary index, is that the key is not unique, so we will either have to store all the values references for each key, or make each key unique by appending a unique row identifier. This way we can us both B-tree or LSM-tree to store the secondary index.

Storing values within the index #

In an index, the queries search for the key. But the value can be one of two things: either it could be the row itself, or a reference to the actual row. In the latter case, the rows are stored in what is known as a heap file, where data is stored in no particular order. The heap file is the most common approach, since it prevents duplicating all the rows for each indexes, it is more efficient when the value is modified but not the key. In other cases, the extre hop from index to the heap file is too much of a performance penalty for the reads, so we will prefer to store the row in the index. There is a compromise between a clustered index (storing the rows within the index) and a non-clustered index, which is known as a covering index, which store a part of the columns of the row in the index.

Multi-column indexes #

Indexed keys can also concatenate multiple columns. It is needed when a query needs to search for a combination of columns. It is called a concatenated index. It simply combines several fields into one key by appending one the columns. More specifically, we can use Multi-dimensional indexes for queries over several colums at once. For instance, a geospatial index would be a multi-dimensional index, as it would need to search for points in a 2D space (latitude & longitude).

Full-text search and fuzzy indexes #

In some cases, we may not have the exact data the query is looking for, or the query may even be impresice (for instance, if a user make a typo in a search query). In those cases, we want to search for similar keys and not exact keys. This is called a fuzzy search. In full-text search, a search for one word will often be matched against the same word, but also its synonyms, the plural forms will be ignored, the number of occurences of the word in the document will be taken into account, etc… Typos will be handled though linguistic analysis of the documents. Apache Lucene is a popular library for full-text search (it uses SSTable-like structure), and it is used in many databases (e.g Elasticsearch).

Keeping everything in memory #

Most of the previously discussed indexes are disk-based indexes. We want to use disk mostly because the data on it is durable and not lost in a crash/power outage. The lower cost per GB of disk over RAM is also a great advantage, but it is becoming less and less of an issue as the cost of RAM is decreasing. We can now think of only using in memory databases.

Memcached is a popular in-memory key-value store, intended to be used as a cache, when loosing data on a machine restart is acceptable. Other in-memory databases aim for durability, which can be achieved with different methods such as battery-powered RAM, writing a log of changes to disk, writing periodic snapshots to disk, or through reiplication on other machines.

Conterintuitively, the performance advantages of in-memory databases are not due to the fact that they don’t need to read from disk. Even disk based databases may only read from memory if they have enough, as the recently used disks blocks are cached by the OS. The main advantage of in-memory databases is that they can avoid the overhead of encoding and decoding data into a disk format, which is a big part of the time spent on read and write operations.

Another advantageof in-memory databases is that they provide data models that are difficult to implement in disk based indexes. Redis for instance offers access to data structures like queues and sets.

Designin Data Intensive Applications Book notes - This article is part of a series.
Part 3: This Article