DDIA | Chapter 3 | Storage And Retrieval | Data Structures That Power Your Database

Files – Simplest data structure database uses for storage

Let’s say our database writes data to files. Say, key could be the name of the file and value would be a json document. Every time one wants to write something to a file, you just look up with the key and add to the end of the file new data. This method does increase the write speed but lowers down read speed. Why? Because for every read, first you find the key (O(n)), and then you traverse the files to find the latest record (O(m)), making the operation roughly O(n^2). For this reason indexes were introduced. 

Indexes

Index is altogether a different data structure supported by databases to speed up the reads. The general idea behind them is to keep some additional metadata on the side, which acts as a signpost and helps one to locate the data one wants.

As indexes need one to store additional data in another structure, it does slow down writes to some extent. And so databases doesn’t index everything. But give the choice to application developer to decide the things developer wants to index as they know the application needs best. 

Hash Indexes

Similar to hashmaps in programming languages. 

Considering the files as storage example. The simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found.

Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

The hash maps are in-memory. So one limitation they have is they can hold the keys until it reaches RAM size, after which they would fail, if we don’t have another mechanism to handle the spillage. So these systems are more useful when the system needs to update values of keys more frequently.

Another problem we might run into is that we are out of disk space as we keep on appending to same file. A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments. Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. Since these compacted segments are small, we can merge them on background thread and simultaneously serve any new read/write requests.

Other issues that we can face during implementation – 

File format

CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).

Deleting records

If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.

Crash recovery

If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. 

Partially written records

The database may crash at any time, including halfway through appending a record to the log. 

Concurrency control

As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

SSTables (Sorted String Table)

For SSTables, we sort the key-value pairs using keys. Some advantages of SSTables over hash indexes using log based segments – 

  1. Uses mergesort like approach and so merging segments is easy.
  2. In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. As keys are sorted, you can store indexes for a segment of keys and use them to find the required keys.
  3. Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk .

SSTable storage engine works in following ways – 

  • When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.

  • When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.

  • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.

  • From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

Only drawback of this scheme – recovery after crashing is not easy, so for that we can maintain a log which is updated with each write and can be used to recover the database if system crashes. 

LSM-tree (Log-Structured Merge-Tree )

LSM tree basically keeps a cascade of SSTables that are merged in the background. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

B-Trees

Most common indexing method used is B-Trees. They are the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. 

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). If you want to add a new key, you need to find the page whose range encompasses the new key and add it to that page. If there isn’t enough free space in the page to accommodate the new key, it is split into two half-full pages, and the parent page is updated to account for the new subdivision of key ranges.

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state.

Latches (light weight locks) are used while writing to B-Trees to avoid data inconsistencies if there are multiple threads concurrently accessing it. 

B-tree optimizations

  1. Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme.
  2. We can save space in pages by not storing the entire key, but abbreviating it.
  3. Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

Comparison of B-Trees and LSM Trees

  1. one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification. LSM trees have lower write amplification than B-Trees as LSM Trees sequentially write compact SSTable files. On the other hand,  B-Trees write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed. 
  2. LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. 
  3. In B-Tree, each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments.

Secondary indexes

Indexes created by the application developers to assist in joins. It is creates from the key value pairs, and not necessarily unique. 

Clustered Indexes

We do search using keys. The result could be a row or a reference to where that row is stored. Generally the indexed result is stored in a heap. This avoids duplication of data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.

In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index.

Covering Index

A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index. This allows some queries to be answered by using the index alone 

Concantenated index

It simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated). This is like an old-fashioned paper phone book, which provides an index from (lastnamefirstname) to phone number. Due to the sort order, the index can be used to find all the people with a particular last name, or all the people with a particular lastname-firstname combination. However, the index is useless if you want to find all the people with a particular first name. Commonly used for geospatial data.

In-memory database

Example – Memcache, are intended for caching use only, where it’s acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica (unless special hardware is used).

In-memory databases are faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.

In-memory databases provide data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.

Thanks for stopping by! Hope this gives you a brief overview in to data structures that power our databases. Eager to hear your thoughts and chat, please leave comments below and we can discuss. 


2 responses to “DDIA | Chapter 3 | Storage And Retrieval | Data Structures That Power Your Database”