Lets Code Them Up!

DDIA | Chapter 3 | Storage And Retrieval | Transaction Processing or Analytics

August 5th, 2023

In the previous post, we discussed different data structures that power current databases, in this post we will look into another type of databases – data warehouses

Online Transactions Processing – Access pattern of data included querying small number of keys using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing.

Online Analytics Processing – An analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. Such type of access patterns came to be known as online analytics processing.

Pattern	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or event stream
Primarily used by	End user/customer, via web application	Internal analyst, for decision support
What data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Data Warehousing

Collection of read-only data from multiple OLTP databases in an organization, so that analysts can run as much queries as they want to without affecting the operations of transactional databases.

Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL).

Stars and Snowflakes: Schemas for Analytics

Star schema – At the center of the schema. Each row of the fact table represents an event that occurred at a particular time. Some of the columns in the fact table are attributes, other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star. A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions.

Thanks for stopping by! Hope this gives you a brief overview in to OLAP databases. Eager to hear your thoughts and chat, please leave comments below and we can discuss.

DDIA | Chapter 3 | Storage And Retrieval | Data Structures That Power Your Database

July 30th, 2023
Files – Simplest data structure database uses for storage

Let’s say our database writes data to files. Say, key could be the name of the file and value would be a json document. Every time one wants to write something to a file, you just look up with the key and add to the end of the file new data. This method does increase the write speed but lowers down read speed. Why? Because for every read, first you find the key (O(n)), and then you traverse the files to find the latest record (O(m)), making the operation roughly O(n^2). For this reason indexes were introduced.

Indexes

Index is altogether a different data structure supported by databases to speed up the reads. The general idea behind them is to keep some additional metadata on the side, which acts as a signpost and helps one to locate the data one wants.

As indexes need one to store additional data in another structure, it does slow down writes to some extent. And so databases doesn’t index everything. But give the choice to application developer to decide the things developer wants to index as they know the application needs best.

Hash Indexes

Similar to hashmaps in programming languages.

Considering the files as storage example. The simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found.

Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

The hash maps are in-memory. So one limitation they have is they can hold the keys until it reaches RAM size, after which they would fail, if we don’t have another mechanism to handle the spillage. So these systems are more useful when the system needs to update values of keys more frequently.

Another problem we might run into is that we are out of disk space as we keep on appending to same file. A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments. Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. Since these compacted segments are small, we can merge them on background thread and simultaneously serve any new read/write requests.

Other issues that we can face during implementation –

File format

CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).

Deleting records

If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.

Crash recovery

If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful.

Partially written records

The database may crash at any time, including halfway through appending a record to the log.

Concurrency control

As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

SSTables (Sorted String Table)

For SSTables, we sort the key-value pairs using keys. Some advantages of SSTables over hash indexes using log based segments –
1. Uses mergesort like approach and so merging segments is easy.
2. In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. As keys are sorted, you can store indexes for a segment of keys and use them to find the required keys.
3. Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk .
SSTable storage engine works in following ways –
- When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.
- When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
- In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
- From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
Only drawback of this scheme – recovery after crashing is not easy, so for that we can maintain a log which is updated with each write and can be used to recover the database if system crashes.

LSM-tree (Log-Structured Merge-Tree )

LSM tree basically keeps a cascade of SSTables that are merged in the background. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

B-Trees

Most common indexing method used is B-Trees. They are the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory.

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). If you want to add a new key, you need to find the page whose range encompasses the new key and add it to that page. If there isn’t enough free space in the page to accommodate the new key, it is split into two half-full pages, and the parent page is updated to account for the new subdivision of key ranges.

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state.

Latches (light weight locks) are used while writing to B-Trees to avoid data inconsistencies if there are multiple threads concurrently accessing it.

B-tree optimizations
1. Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme.
2. We can save space in pages by not storing the entire key, but abbreviating it.
3. Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.
Comparison of B-Trees and LSM Trees
1. one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification. LSM trees have lower write amplification than B-Trees as LSM Trees sequentially write compact SSTable files. On the other hand, B-Trees write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed.
2. LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
3. In B-Tree, each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments.
Secondary indexes

Indexes created by the application developers to assist in joins. It is creates from the key value pairs, and not necessarily unique.

Clustered Indexes

We do search using keys. The result could be a row or a reference to where that row is stored. Generally the indexed result is stored in a heap. This avoids duplication of data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.

In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index.

Covering Index

A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index. This allows some queries to be answered by using the index alone

Concantenated index

It simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated). This is like an old-fashioned paper phone book, which provides an index from (lastname, firstname) to phone number. Due to the sort order, the index can be used to find all the people with a particular last name, or all the people with a particular lastname-firstname combination. However, the index is useless if you want to find all the people with a particular first name. Commonly used for geospatial data.

In-memory database

Example – Memcache, are intended for caching use only, where it’s acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica (unless special hardware is used).

In-memory databases are faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.

In-memory databases provide data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.

Thanks for stopping by! Hope this gives you a brief overview in to data structures that power our databases. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Chapter 2 | Data Models and Query Languages | Graph-Like Data Models

July 29th, 2023
In previous posts on data models, we saw that when you have one-to-many relationships, the best model is Tree and when you have no relationships than the best model is document model.

Similarly when you have many-to-many relationships, then for simpler relationships, relational model works, but as your structure becomes complex then it is best to use graph models.

Graph models generally has vertices and edges. The objects are the vertices and edges represent the relationships.

For example, Facebook’s data model, it uses heterogeneous graph models. It has customers, comments, likes, friends, etc, as vertices and has relationships between them as edges.

Property Graph

In the property graph model, each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices
- A collection of properties (key-value pairs)
You can think of a graph store as consisting of two relational tables, one for vertices and one for edges. The head and tail vertex are stored for each edge; if you want the set of incoming or outgoing edges for a vertex, you can query the edges table by head_vertex or tail_vertex, respectively.

Some important aspects of this model are:
1. Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward.
3. By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.
Cypher is a declarative query language for property graphs, created for the Neo4j graph database.

Triple-Stores and SPARQL

In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.

The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:
1. A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, (lucy, age, 33) is like a vertex lucy with properties {“age”:33}.
2. Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in (lucy, marriedTo, alain) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.
SPARQL is a query language for triple-stores using the RDF data model. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar.

As we discussed in previous post, network models and relational models both could be used to solve many-to-many relationship problems. Are graph databases the second coming of CODASYL(network model) in disguise?

No. They differ in several important ways:
- In CODASYL, a database had a schema that specified which record type could be nested within which other record type. In a graph database, there is no such restriction: any vertex can have an edge to any other vertex. This gives much greater flexibility for applications to adapt to changing requirements.
- In CODASYL, the only way to reach a particular record was to traverse one of the access paths to it. In a graph database, you can refer directly to any vertex by its unique ID, or you can use an index to find vertices with a particular value.
- In CODASYL, the children of a record were an ordered set, so the database had to maintain that ordering (which had consequences for the storage layout) and applications that inserted new records into the database had to worry about the positions of the new records in these sets. In a graph database, vertices and edges are not ordered (you can only sort the results when making a query).
- In CODASYL, all queries were imperative, difficult to write and easily broken by changes in the schema. In a graph database, you can write your traversal in imperative code if you want to, but most graph databases also support high-level, declarative query languages such as Cypher or SPARQL.
Datalog

Datalog is used in a few data systems: for example, it is the query language of Datomic and Cascalog is a Datalog implementation for querying large datasets in Hadoop.

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object).

It uses rules to tell database about new predicates. These predicates aren’t triples stored in the database, but instead they are derived from data or from other rules. Rules can refer to other rules, just like functions can call other functions or recursively call themselves. Like this, complex queries can be built up a small piece at a time.

Thanks for stopping by! Hope this gives you a brief overview in to graph data models. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Chapter 2 | Data Models and Query Languages | Query Languages for Data

July 17th, 2023
There are broadly two methods to execute queries in database –
1. Imperative – in this, code tells computer to perform certain operations in sequence. Most of the programming languages are imperative.
2. Declarative – in this method, one specifies the pattern of data needed, rest all is taken care by query optimizer. Most of the SQL languages are declarative. It is easier to work with, hides database engine’s implementation details. Also these languages support parallisation across multiple machines.
MapReduce Querying –
1. Neither imperative, nor declarative, combination of two.
2. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.
3. Map and reduce are pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects.
Thanks for stopping by! Hope this gives you a brief overview in to query languages for data models. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Chapter 2 | Data Models and Query Languages | Relational vs Document Model

July 9th, 2023
Most applications are built by layering one model over another. Data model at each layer abstracts the complexity of layers below it. This abstraction allows different group of engineers at different layers to work together effectively.

The best known model is SQL based on relational model. Relational model were introduced first in 1970s. Their adoption increased in 80s and 90s. Since then they have been dominant in the software world. Document models and hierarchical models came but were not able to replace the relational model.

Next came NoSQL (Not only SQL) databases came into picture. They have gained popularity because –
1. They scale well
2. dynamic schema rather than restrictive relational schema
3. support wide variety of operations
4. open source
Even then as different applications have different requirements, both SQL and NoSQL databases/data stores could be used alongside. (polyglot persistence).

The Object-Relational Mismatch

Impedance mismatch is the difference between the objects in the object oriented programming languages and the database tables.

When it comes to SQL, it had high impedance mismatch. There is need for layer to translate between the application layers and storage layers, adding overhead to application development.

When it comes to NoSQL, it is said to have low impedance mismatch when compared to SQL.

Consider the example of linkedin profile. It has many sections like Education, Location, Past experiences, Summary, languages, skills, etc. When storing this information in SQL, there will be multiple tables connected with each other through foreign keys. There could be one table to store users, another to store their location, another to store their education and so on. The application developer would have to code the complex queries/joins to store/retrieve data from database, with complex logic to convert application model to database model and vice versa.

While in document based model, the entire resume could be store as JSON structure. Developers would only need to take care of encoding/decoding that json. No need for joins/complex queries here.

Many-to-One and Many-to-Many relationships

There are fields in databases which could either be given a plain text value or could be given a ID. An ID will be internal to the database, wherever that value is needed we could use that ID. This method avoids duplication. If the value changes, there is no need to change the corresponding data in different places across tables. This is called normalization and often used in relational databases.

Thinking of example, if we want to store states where people live given in the resume, we could either store the states with their original human given names like California, Michigan, etc. Or we can store their abbriviations, like CA, MI, etc. If we use the abbreviations then we would need to store those somewhere in database and point to them through our users table. This is creating a many to one relationships. There could be many people living in one state.

The above kind of normalization is possible in relational databases as they support joins. However, not possible in document databases as there is no support for joins. So developer would need to emulate joins through application code. Thus transferring logic to application side.

Even if at beginning, document based model is used as there is small amount of join duplication logic needed, with the addition of new features there could be more complex relationships added between models and thus increasing load on application developer.

Are Document Databases Repeating History?

The problem of joins and many-to-many relationships was present when the databases came into existence (eg. hierarchical model used by first database IMS). And at that time, relational models and network models were suggested as solutions.

Network model was like hierarchical model, tree structure, only difference was that one node can have multiple parent. The links between the records were like pointers in programming language. To find a record, one would need to go through all the records one by one through these pointers, called access path. The main drawback was that as one record has multiple parents, programmers would need to keep track of all the multiple paths. This made querying and updating data inflexible.

Relational model on the other hand, stores data as a relation or table. No complex tree structure, no access paths, simple queries to get all data or some data based on some conditions. Query optimizer automatically decides which part of query to execute first and which indexes to use. Once a query optimizer is written for a database, multiple application developers can then make use of it. If it’s not written, then also manual queries were easy to write.

Which data model leads to simpler application code?

It depends upon the data needed for the application. If there are many-to-many relationships needed, joins needed, then best to use relational model as that would reduce the application code. If the data is like a document, less nested structure, then best to use document based models.

Schema flexibility

Document databases are schema-on-read, while relational databases are schema-on-write. Schema-on-read means that the database as such don’t enforce any schema, but upon reading, the application code assumes a structure while reading the data from the database. Because schema is enforced at time of reading, document databases provide more flexibility.

Because there is no schema enforced at the time of writing, any changes needed for any field in document database is simple, one just needs to start writing the new fields and let application code handle the old field if the document has it. On the other hand, for relational database, one would need to migrate the data in entire table if one needs to change a field.

Data locality for queries

Suppose you need to render the entire document on a webpage, then it’s best to use document database as they provide the whole document upon querying. In case of relational databases, one would need to do complex queries and joins to get the same data, which is lot of useless operations.

However if only a part of document is needed, then using document database is useless as they provide the whole document on querying. So it depends on the usage as well, whether one wants to use which type of database.

Most relational databases offer locality driven features, example google’s spanner, oracle’s multi table cluster tables, etc.

Convergence of document and relational databases

Most relational database like PostgreSQL, MtSql support XML/JSON, querying inside XML/JSON documents, etc, which allows same support as while using document based databases. Similarly there are document databases like MongoDB that provide joins. It seems like both relational and document models are becoming similar, which is good as it provides more flexibility.

Thanks for stopping by! Hope this gives you a brief overview in to data models. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Maintainability

July 1st, 2023
so far we have seen Reliability and Scalability aspect of data systems, this post looks into maintainability.

It is well known that majority of cost of system is in its maintenance, eg, fixing bugs, adding new functionality and so on. Depending on the complexity of the system, maintenance could be easy or difficult. So we should strive to design and develop the system in a way that it minimizes system’s maintenance. For this, we should try to follow the following three principles –
1. Operability – Make it easy for operations teams to keep the system running smoothly. It means making routine tasks easy for operations team so they can focus on high value tasks. To make routine tasks easy, we could –
  - Add good coverage of metrics and monitoring so operations team has good visibility into system.
  - integrate system with current tools and automate few tasks.
  - avoiding dependency on single machine.
  - provide good documentation.
  - provide good default behavior with freedom to the operations team to change as needed.
2. Simplicity – Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. When a system is initially designed, it is simple, but with each new feature, if not designed and implemented properly, system’s complexity increases. We should use tools like abstraction to avoid implementing complex systems. Abstractions provide good facade to the underlying complex system, thereby making integration with system easy. eg – SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes.
3. Evolvability – Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. A very well designed system would make sure the system is adaptable and can be changed easily for new requirements.
Thanks for stopping by! Hope this gives you a brief overview in to maintainability aspect of data systems. Eager to hear your thoughts and chat, please leave comments below and we can discuss.

Lets Code Them Up!

DDIA | Chapter 3 | Storage And Retrieval | Transaction Processing or Analytics

Pattern

Transaction processing systems (OLTP)

Analytic systems (OLAP)

Data Warehousing

Stars and Snowflakes: Schemas for Analytics

DDIA | Chapter 3 | Storage And Retrieval | Data Structures That Power Your Database

Files – Simplest data structure database uses for storage

Indexes

Hash Indexes

SSTables (Sorted String Table)

LSM-tree (Log-Structured Merge-Tree )

B-Trees

B-tree optimizations

Comparison of B-Trees and LSM Trees

Secondary indexes

Clustered Indexes

Covering Index

Concantenated index

In-memory database

DDIA | Chapter 2 | Data Models and Query Languages | Graph-Like Data Models

Property Graph

Triple-Stores and SPARQL

Datalog

DDIA | Chapter 2 | Data Models and Query Languages | Query Languages for Data

DDIA | Chapter 2 | Data Models and Query Languages | Relational vs Document Model

The Object-Relational Mismatch

Many-to-One and Many-to-Many relationships

Are Document Databases Repeating History?

Which data model leads to simpler application code?

Schema flexibility

Data locality for queries

Convergence of document and relational databases

DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Maintainability