Sorted String Table in Rust

2020-09-15

Recently, I've been learning Rust. Then, I started implementing sorted string table in Rust in order to learn Rust deeply and also understand SSTable architecture, which is used by underlying storage engines in various modern database softwares.

I'm going to describe what I implemented as my first library in Rust.
GitHub repository: petitviolet/rsstable

Architecture Overview

This section describes what I've been implementing in Rust.
Sorted String Table(SSTable) instance has memtable and disktable in order to store data both in-memory and in-disk. Overview of components reside in SSTable are:

SSTable
- memtable
  - write-ahead-log: file
  - key-value store: std::collections::BTreeMap
  - deleted records store: std::collections::BTreeSet
- disktable
  - generation
    - index
      - skip index file
      - index file file
    - data: file
      - array of actual key-value

When writing a record which is a pair of key and value into SSTable, it writes a log in disk called write-ahead-log(WAL) to avoid losing data even if the process that have lots of records in memory is crashed. Then, insert key and value into memtable which stores the inserted data in memory. So, we'll be able to get and set key-value pair within memtable quickly.
However, we can't store all of data in-memory as much as ideally we want a database to store. Next, using disktable instead of memtable to store large volume data when its amount is larger than memory can store temporary.
Disktable is for managing data in disk as the name indicates. However, we know accessing disk is much slower than memory so that it implements indexing to make seeking data faster as much as possible.

Every after flush process that persist in-memory data to files, SSTable increments its "generation" number. Older version data have smaller generation.

How Read Works

SSTable stores data within itself so that we can read data as well, of course. This section describes how to read the records by a key from SSTable. When trying to fetch a record by a key, at first SSTable tries to get the value associated with the given key from memtable. If it returns a found record, we can obtain it as the result. Otherwise, then it needs to ask the disktable to find it. Disktable, at first, reads an index file in order to find the location, which is an offest in the data file of the record. And then it reads the data file that stores data composed of key-value pairs to get the record. If there is no record matches the given key, SSTable returns None, otherwise Some.

How Write Workds

Next, how to write records into SSTable. When writing a record into SSTable, first of all, it writes a log in WAL as described above, and then store it in the key-value store that memtable manages. If the capacity of memtable reaches an threshold, it requires to flush in-memory data into disk to make space in memory and have it persisted. After the memtable passes its records to the disktable to ask flush them, the memtable clears the WAL and its own key-value store, and the disktable starts flushing process. As the name of "SSTable" indicates, it stores strings in sorted fashion for some effectivity.

Implementation

The entrypoint of SSTable is SSTable struct.

pub struct SSTable {
    memtable: Box<dyn memtable::Memtable<Key = String, Value = String>>,
    disktable: Box<dyn disktable::Disktable>,
}

I know it looks wired since Memtable has 2 associated types, but Disktable doesn't. This is based on my first design draft that they could have 2 associated types as well for key and value, however, both should be String as the name is Sorted String Table designed to store String. Anyway, as described in the previous sections, Memtable stores Key-Value pairs within its own BTreeMap and deleted keys in BTreeSet to identify whith records were deleted recently, and Disktable stores String consists of key-value pair in filesystem using actual data file and index file.

The most difficult part is "flush" process that dumps in-memory data into file system, that is transferring data from Memtable contents to Disktable. In order to do it, at first, Memtable needs to pass its contents to Disktable with guarantee of keeping Memtable itself being able to accept clients' requests.
However, I couldn't find out a better way to use callback pattern which allows Memtable processing something after completion of writing data by Disktable. Instead of that, allowed Disktable to hold flushing data passed from Memtable while writing the data into files.
Actual codes are:

Because of Rust's memory management system, I assume that it's likely impossible to implement callback style to do that.

Memtable

Memtable has 3 components:

WAL(Write-Ahead-Log)
BTreeMap for underlying data
BTreeSet for deleted keys called tombstones

WAL layout

Memtable writes write-ahead-log into log files in the following format:

...
I\t<key>\t<value>\n
D\t<key>\n
...

As it stores both inserted key-values and deleted keys, the first char is either "I" or "D" to represent each line is for which of "I"nserterd or "D"eleted. During the process of recovery memtable contents when it has been crashed somehow, BufReader reads each line of WAL and construct memtable contents.

Disktable

About Disktable, as described in previous sections, it has 2 file types, index and data.

Data Layout

This section explains how data file stores key-value pairs as binary format looks like below.

..........[key length][value length][ key data  ][value data ]\0[key length][value length].......
<-offset-><-4 bytes--><--4 bytes---><--key_len--><-value_len->

Data are separated by \0(NULL character). When fetching a key-value pair from data file, it requires the offset from the beginnig of datafile for the key.

Index Layout

In order to make it efficient to get the offset of a key, index is so important.
Actually, index consists of skip index file and index file.
As index file stores all of the offsets of keys in data file, it's impractical to find a key from possibly too large amount of data.
In order to improve performance, I introduced index for index, named skip index. It stores sampled keys' offsets to point its index entry to get the actual offset of data file. The layout of skip index file looks like:

[key 0]\t[offset in index file]
[key N]\t[offset in index file]
[key2N]\t[offset in index file]
...

Skip index stores a key and offset in index file every N(>0) keys. We can get a near offset of the index entry for the key, and then seek the exact index entry from the offset obtained from skip index.

Future Idea

To fetch a data from SSTable, with the current implementation, it needs to read at least 3 files, skip index, index, and data file. Since index and data files are separated by generations, as the generation increases, more read will happen. Hence, compaction will be necessary to reduce file accesses.

Summary

Implementing SSTable is helpful to learn Rust, since it needs memory management, standard collections, interaction with filesystem, casting types, iterators, type abstractions, and so on. Although it turned out what this post describes is mostly about not its implementation, but the design of what I implemented... Anyway, it was a fun for me :)