DDIA | Chapter 3 | Storage And Retrieval | Column Oriented Storage

In the first post on chapter 3, we went over data structures that power our databases, and in the second got to know about OLTP and OLAP. In this post, we will look how OLAP databases use column oriented storage and what are its advantages. 

Transactional databases which are generally relational databases store and retrieves data on basis of rows (indexing is done on rows). But when we consider data warehouses, and queries are more analytic, they are usually querying few columns from the database. And if we go with row based, then we will be returning entire row with 100s of columns for each of the records parse and filter out data. All of this takes up lot of compute resources. Therefore, indexing on columns in this case makes more sense. 

In column based storage, columns are stored together instead of rows. And if you want the full row, you can find the record from each column and combine them. While saving in columns we need to make sure that rows are ordered in same way across all column indexes. If query needs only two columns, then we just need to load those two columns, parse and filter them. This reduces amount of data to be loaded from disk. 

Column Compression

As we are storing on the basis of columns, lot of data in each column is repeating. So instead of storing them as is, we can compress them using techniques like bitmap compression.

Say you have 100,000 rows in a product_sk column, but you have only 10 types of products. So names of products in this column are repeating. Instead of storing each value separately, in bitmap compression, each product is stored as bitmap of 100000 bits where 1 represents that product presence and 0 absence. Further, if number of rows is small, these bitmaps can be stored in one bit per row, if n is bigger, then we can have n separate bitmaps, one bitmap for each distinct value and one bit for each row. 

Column compression not only reduces amount of data to be loaded from disk, but also contributes in making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Column compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing.

Sort Order in Column Storage

It is easiest to store the data in the manner they are inserted, so for adding new data, we just add to the end of each column files.All the rows need to have same sort order though, otherwise how will we know which data belongs to which row. 

A technique generally used by databases is to have multiple sort orders. As we know data is replicated to multiple machine, it can be stored using different sort orders on each machine. So the sort order most effective for certain type of query can be used for fast results. 

Having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store. But the big difference is that the row-oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. In a column store, there normally aren’t any pointers to data elsewhere, only columns containing values.

Writing to Column-Oriented Storage

Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

LSM-trees are used over B-Trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk.

Aggregation: Data Cubes and Materialized Views

data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often?

One way of creating such a cache is a materialized view. In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual view, the SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query.

A common special case of a materialized view is known as a data cube or OLAP cube. It is a grid of aggregates grouped by different dimensions.

Thanks for stopping by! Hope this gives you a brief overview in to column based storage. Eager to hear your thoughts and chat, please leave comments below and we can discuss.