Storage and Retrieval Data Structures and Techniques

Before we delve into modern storage data structures, I want to clarify what this blog is not about:

Database systems
Hard drive systems

This blog is about current trends in storage techniques in crafting efficient and low-latency data structures.

Simple Database

The idea of storing data is simple. We need it to be persistent, which means we need it to be committed to the hard drive. That's not the only requirement, but it does give us a good starting point. Let's build upon this and make our first database:

echo "hello,world" >> database

If you look into the database, you should see "hello,world".

This is as simple as it can get. And to be honest, if it fulfills your requirements, then stick with it. Don't overcomplicate things.

Quick question: What is the time complexity if you want to retrieve a value?

Answer: It's O(n). We need to scan the whole document. Can we do better?

Better Version - Using Index for Search

Let's make our search better. What's the data structure that you throw at any problem when you don't know how to solve it? It's our good old hashmap. Let's keep database intact but let's make a small change. Let's keep an in-memory hashmap that stores the byte offset as the value corresponding to the key.

What is the byte offset?

It is the byte at which you are appending the value in the file. Let's consider the following:

# In memory hashmap
byte_offset_map = {"key1", "OXff0034","key2", "OXff0039","key3", "OXff0049"}
# In hard disk, the database looks like:
# "key1": "value1"
# "key2": "value2"
....

Here is how database looks:

key1 : value1

key 1 starts at byte OXff0034

key2 : value2

key 2 starts at byte OXff0039

You can directly get the byte offset of a particular key and the index into the database.

This is a pretty efficient storage system. It assumes that all the keys can be stored in RAM. This assumption will work fine for small data storage solutions.

What happens when the database file gets too big? Remember, you are just appending to it. One solution is to perform compaction. It is a technique in which the database file is broken down by removing duplicate keys in the log and new smaller segments are created out of it. We can also merge several segments together. It is made sure that the compaction and merge happens on a different thread so the writes and read requests are still served by a main thread using the older segment files in memory. Once merge and compaction happens, the older files are deleted.

Some Issues

How would you delete a record? Some databases append a special character called tombstone that tells the merging process to delete the record.
How will you recover from a crash? If the system crashes, the in-memory hashmap will be lost.
The database might crash and cause some records to be partially written.

Advantages of Hash Table

Extremely fast writes since you are just appending to a file
Crash recovery is easier since you are sure that at least you don't have a mismatch of data between different segments when a value was being overwritten since each segment is append-only and immutable.

Disadvantages of Hash Table

Must fit in memory.
Does not support range queries. What if I want to get all keys between "key1" and "key9999"?

SSTables and LSM-Trees

What if you build from the last approach and make a simple change? Just keep the segment files sorted by keys. A simple merge sort algorithm can be used to merge two segments. What if the same key appears in several segments? We keep the value from the most recent segment.

SSTables stands for Sorted String Table and LSM stands for Log-Structured Merge-Tree

The major problem SSTables solves is that it supports range queries. Assume this hypothetical data structure:

key      byte offset         sorted segments -> alphabetically sorted
hello    0Xff0089    ->      hello:"some value"(start at 0Xff0089 )| help: "another value" | helt : "some another value"
hi       0Xff0179    ->      hi:"some value"                       | hii: "another value"  | hit : "some another value"
jelly    0Xff0234            and so on

Let's say I am looking for a word hulk. I will look at my byte offset dictionary and will know that it must lie between 0Xff0089 and 0Xff0234. While searching in between those addresses, I further compress and merge these segments to make it more efficient.

Constructing SSTables

The following steps are followed:

During a write, we add the key and value to an in-memory hashmap (Red-black or AVL trees).
When the hashmap becomes too big, we write it out to disk as an SSTable file. Writes continue to happen during this operation.
During a read request, find the key in the hashmap. If not, go down the SSTable hierarchy based on time.
Run a merging and compaction process at intervals.

Advantages

It can support very high write throughput because it writes data to a hashmap sequentially.
Supports range queries.

B-Trees

B-Trees are one of the more common data structures employed in the industry. Like LSM-Trees, they keep their key and value sorted. The only difference being that they store fixed-size blocks in memory (around 4 KB called page) rather than variable-size blocks stored by LSM-Trees. This emulates how the hardware stores its memory as well and because of that gives us better speed on reads.

A B-Tree starts with a root page that consists of pointers to other children pages. These pointers refer to memory on disk not in RAM. Whenever you get a read query, you start from the root of the tree, and then traverse down to its children. The process looks like:

Look up user id = 25 -> will lie between node 20 and 30

Assume the root page to look like: [10, 20 (Its children contain 20<=key<=30), 30, 40, 50]

Go to the child page to node having key 20 and look at that. Let's say we get: [22,23,27]
We know 25 will lie in the child page of 23 since 23 <= 25 <= 27.
We continue this process until we reach the end or find the node.

If you want to update

the value of a key, you first find the node, change its value, and write the page back to disk.

If you want to add a new key that does not lie in the range of the current page, for example, if you want to add 60 to [10,20,30,40,50] and there is no space left in the page then the page is broken down into two full pages to accommodate the new value.

B-Trees with n keys have a depth of O(log n)

A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB!

Reliability

Any write operation in a B-Tree overwrites a page on disk with new data. It does not change the memory address of the page itself. It just changes the pointers to memory stored in the page. That is very different from LSM-Trees which only supported append-only operations.

This can also be considered a downside of B-Trees because what if the database crashes in the middle of a write? You might get dangling references to memory. To make it more reliable, B-Trees implement write-ahead log (WAL) ¹.

Issues with B-Trees

If a range query is requested from the database, it is possible that B-Trees have to read all the pages from the disk which can be quite inefficient because there is no rule that enforces B-Trees to store pages sequentially in memory. In contrast, LSM-Trees rewrite large segments of the storage in one go during merging so it's easier for them to keep sequential keys closer to each other on disk.

TL;DR of B-Trees vs LSM-Trees

B-Trees	LSM-Trees
Slower writes (Might need to create two pages if one page is full and needs to write to WAL ¹)	Faster writes
Faster reads	Slower reads because they have to check various data structures (sorted segments)

When LSM-Trees are Better than B-Trees

Write Amplification: Number of writes to the disk due to a single write to the database

LSM-Trees have lower write amplification because they sequentially write data to SSTable. B-Trees, on the other hand, have to do in-memory operations so they might need to read several pages, change pointers of parents, children, etc.
LSM-Trees can be compressed better and use less memory, B-Trees, on the other hand, might cause fragmentation because of unused space on the page due to it having fixed size.

When B-Trees are Better than LSM-Trees

Compaction and merging processes reduce the performance of the ongoing reads and writes in LSM-Trees.
If write throughput is more than the compaction speed, you will get unmerged segments on the disk until you run out of space. It will reduce the read throughput as well since there will be more SSTables to read from.
In B-Trees, a key exists in only one place, whereas in LSM-Trees, there can be duplicate keys in different segments.
Implementing concurrency is easier in B-Trees. You just need to implement simple locks.

That was it. I hope it was informative. There is no good or bad choice for data storage. You just need to consider the tradeoffs and make an informed decision.

-G

Footnotes

This is an append-only file to which every B-Tree modification must be written before it can be applied to the pages of the tree. It is used to restore the tree if some crash or corruption happens. ↩ ↩²