blog.petitviolet.net

Building Relational Database on top of RocksDB in Rust

2021-05-25

DBRustRocksDB

Recently I developed a tiny database system called rrrdb in Rust on top of RocksDB, which is a well-known Key-Value store.
“rrrdb” came from Relational-database in Rust with RocksDB, though I’d not say rrrdb is a relational database since it doesn’t support having relations between objects at all. rrrdb has SQL interface to interact with its underlying storage layer built, RocksDB. rrrdb just provides SQL interface to interact with its underlying storage layer, which is RocksDB. This post describes what rrrdb looks like and its architecture.

GitHub repository: petitviolet/rrrdb

Overview

Copy-pasted from the README. Please ignore ”.unwrap() kindly.

let rrrdb = RrrDB::new(path)
rrrdb.execute("test_db", "CREATE TABLE users (id integer, name varchar)").unwrap();
rrrdb.execute("test_db", "INSERT INTO users VALUES (1, 'Alice')").unwrap();
rrrdb.execute("test_db", "INSERT INTO users VALUES (2, 'Bob')").unwrap();
let result = rrrdb.execute("test_db", "SELECT name FROM users WHERE id = 2").unwrap();
result == 
    OkDBResult::SelectResult(ResultSet::new(
        vec![
            Record::new(vec![
                FieldValue::Text("Bob".to_string()),
            ]),
        ],
        ResultMetadata::new(vec![
            FieldMetadata::new("name", "varchar")
        ])
    ))

I believe it does make sense on what it does since we can see the SQLs to be executed by a RrrDB instance.

Architecture

rrrdb has three components; SQL, Planner, and Executor, for being able to process SQL to interact with RocksDB.

SQL

First of all, it has to be able to process SQL statement strings that clients send. For that, SQL layer is composed of tokenizer and parser. Tokenizer tokenizes a given SQL string into a set of tokens, and then Parser build an AST from the set of tokens. As the result, it obtains the AST that we can build a plan for how to fetch actual data from the underlying storage layer.

Planner

The second layer is Planner. Planner expects AST to be passed that SQL layer produces, then build a plan to get actual data from RocksDB, which is KVS. For example, when we have a SQL select * from users where id = 1, the Planner generates a plan to fetch the single record which has id = 1. Take an another example, when we have a SQL select * from users, which is to get all of records in the users table, the Planner will produce a plan that iterate all of the key-value pairs of users table. Planner is obviously the place that there are lots of room to optimize the performance of this database system, however, as of now I didn’t do any optimization :)

Executor

Executor is the last but perhaps the most important one.
It feeds the plan and access the underlying storage layer in order to fetch data that the client wants. The storage layer treats data as in the form of array bytes, i.e. [u8] in Rust so that RocksDB can store the data without transforming data since rust-rocksdb expects [u8] for keys as well as values. Executor has many rooms to improve in terms of performance, such as data format, caching, bulk operations, etc. In addition to performance, we have lots of alternatives of how to implement transformation from SQL statement to KVS operations, though depending on the architecture of storage layer. If we’d like to have column-oriented database instead of row-oriented database, plans should change to access values of each column one by one of a single record instead of fetching all of columns at once.

Implementation

This section explains some of implementation details of rrrdb that I guess it’s interesting to share.

ColumnFamily

RocksDB offers ColumnFamily feature to support partitioning database logically. A ColumnFamily has a set of key-value pairs. Additionally, we can use Iterator over a single ColumnFamily so that we can get all of the records within a particular ColumnFamily without accessing unrelated data. In rrrdb, it uses ColumnFamilies to store data of a table. It means that fetching all of the records of a particular table is done by run iteration on a corresponding ColumnFamily. rrrdb’s Storage implements iterator method to obtain an iterator over an underlying ColumnFamily.

fn get_column_family(&self, namespace: &Namespace) -> DBResult<&ColumnFamily> {
    let cf_name = namespace.cf_name();
    self.rocksdb
        .cf_handle(&cf_name)
        .ok_or(DBError::namespace_not_found(namespace))
}

pub fn iterator<'a>(&'a self, namespace: &Namespace) -> DBResult<RecordIterator<'a>> {
    self.get_column_family(namespace).map(|cf| RecordIterator {
        db_iterator: self.rocksdb.iterator_cf(cf, rocksdb::IteratorMode::Start),
    })
}

As the above snippet shows, it calls RocksDB::iterator_cf to get an iterator on a ColumnFamily, then wraps it with an original Iterator type called RecordIterator so that it can apply something needed over each key-value pair.

impl<'a> Iterator for RecordIterator<'a> {
    type Item = (String, Box<[u8]>);

    fn next(&mut self) -> Option<Self::Item> {
        let n = self.db_iterator.next();
        n.map(|(key, value)| {
            let string_key = String::from_utf8(key.to_vec())
                .expect(&format!("Invalid key was found. key: {:?}", key));
            (string_key, value)
        })
    }
}

In this example, RecordIterator::next returns a pair of a stringified key and a raw value based on an assumption that all of the keys’ representations are string.

Macro to define suppoted keywords

I used macro to define the list of supported keywords in SQL as following snippet

macro_rules! define_keywords {
  ($($keyword:ident), *) => {
      #[derive(Debug, Clone, PartialEq, Eq, Hash)]
      pub(crate) enum Keyword {
        $($keyword), *
      }
      impl Keyword {
        pub fn find(s: &str) -> Option<Keyword> {
          match s {
            $(s if s.to_lowercase() == stringify!($keyword).to_lowercase() => { Some(Keyword::$keyword) },)
            *
            _ => None,
          }
        }
      }
      impl std::fmt::Display for Keyword {
        fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
          match self {
            $(Keyword::$keyword => { write!(f, "{}", stringify!($keyword)) }),
            *
          }
        }
      }
      pub(crate) const ALL: &[Keyword] = &[
        $(Keyword::$keyword), *
      ];
  };
}

define_keywords!(Create, Database, Table, Select, From, Where, Insert, Into, Values);

Thanks to Keyword::find, it’s so easy to determine if a given string is keyword or not.

match Keyword::find(s) {
    Some(keyword) => return_ok(Token::Keyword(keyword)),
    None => return_ok(Token::Word(s)),
}

Pattern Matching on Vec

One protip I could share around pattern matching on a Vec<A> is the following snippet, which expect 2 Tokens are in a given Vec<Token> and bind them into temporal values.

let mut v = vec![];

// v.push(token);

if let &[Token::Word(column_name), Token::Word(column_type)] = &&v[..] {
    results.push(ColumnDefinition::new(
        column_name.to_owned(), // Is this to_owned() avoidable...?
        column_type.to_owned(),
    ));
} else {
    return Self::unexpected_token("create table column definitions", &v, self.pos);
}

Serialize/Deserialize

rrrdb stores data within RocksDb in the form of Vec<u8>, but the data are basically instances of a Struct. Thus it’s better to be able to get/put struct instances directly without transforming them into Vec<u8> by rrrdb user side. For that, rrrdb depends on serde_json crate which provides abilities to deal with JSON in Rust so that it can automatically apply serialization and deserialization on given structs. An code snippet for getting objects from RocksDB with serialization;

pub fn get_serialized<T: DeserializeOwned>(
    &self,
    namespace: &Namespace,
    key: &str,
) -> Result<Option<T>, DBError> {
    self.get(namespace, key).and_then(|opt| match opt {
        Some(found) => match String::from_utf8(found) {
            Ok(s) => match serde_json::from_str::<T>(&s) {
                Ok(t) => Ok(Some(t)),
                Err(err) => Err(DBError::from(format!("Failed to deserialize: {:?}", err))),
            },
            Err(err) => Err(DBError::from(format!("Failed to deserialize: {:?}", err))),
        },
        None => Ok(None),
    })
}

Converting an obtained Vec<u8> into String with String::from_utf8, then call serde_json::from_str::<T> to deserialize into a T with using T: DeserializeOwned trait functionality.

As you see the above snippet, in my codes of rrrdb, there are so many deeply nested blocks, which makes code not easy to read…

Thoughts

Building a RDBMS-ish system was really fun for me and this was a quite good time to learn Rust and database even thought it has only limited features as well as very poor performance. If you’d like to dive deeper into RDBMS built on Key-Value database, TiDB could be the best option.