Reading Note - Spanner paper

2020-08-01

This post is my reading note for the paper of Spanner. Distributed SQL database and also NewSQL are a kind of trends in our industry. I think that Spanner, Google's globally-distributed database, is the beginning of such era.

The paper is available here.

Yugabyte's blog posts describe what distributed SQL and NewSQL are. If you are interested in, please refer to them.

Note that this memo may include misunderstandings and even be wrong.

Motivation

The main motiviation of why Google invented Spanner was to manage cross-datacenter replicated data. They have Bigtable as their distributed system infrastracture, however, it seemed to be not sufficient to them in terms of managing complex applications and evolving schemas. In order to solve them, Spanner, a Bigtable-based database, offers semi-relational tables on top of versioned key-value store and Paxos as its consensus algorithm.

Overview

Spanner is a database that shards data across many sets of Paxos state machines, which are located in datacenters all over the world.
- Paxos is a well-known consensus algorithm. Raft, another well-known one, is used in CockroachDB, and other NewSQLs.
In order to manage cross-datacenter replicated data
- Other Google's products used to use Bigtable, but it was difficult to manage complex and evolving schemas.
- Alternatively, Megastore have been chosen that offers semi-relational datamodel and synchronous replication support with poor write throughput.
Bigtable-like versioned key-value store
- Semi-relational datamodel
- Each version is automatically timestamped with its commit time
- Old versions will be removed by GC
Provides transaction
SQL based query language
Replication configuration can be dynamically configured by applications
- Which datacenter contains which data
- To control read latency: how far data is from its users
- To control write latency: how far replicas are from each other
Externally consist read, write, and globally-consist reads at a timestamp
- Atomic schema updates all at global scale
Timestamps in Spanner reflect serialization order
TrueTime API
- Exposes clock uncertainty
  - How uncertain Spanner's timestamps are
- As it gets less uncertainty, Spanner gets slower
- Using GPS and atomic clocks

Terminologies

Universe master
- singleton
- a console that display status about all the zones
Placement driver
- singleton
- handles automated movement of data across zones
  - timescale of minutes
- periodically communicates with the spanservers to find data that needs to be moved
Universe
- Spanner deployment
- something like test/production
Zone
- unit of administrative deployment
- unit of physical isolation
  - one or more zones in a datacenter
- has one zonemaster
- has one location proxy
- has hundred ~ thousands of spanserver
Zonemaster
- assign data to spanserver
Spanserver
- serve data to clients
Location proxy
- to locate the spanservers assigned to serve their data
Tablet
- data structure
  - Spanner assigns timestamps to data, Bigtable doesn't
- a set of (key, timestamp) -> value
  - called mappings
- state is stored in set of B-tree like files and write-ahead log
  - all on a distributed file system called Colossus
- spanserver implements a single Paxos state machines on top of each tablet
  - also stores metadata within them
Paxos group
- a set of replicas
Transaction manager
- to support distriuted transactions
- participant leader, participant slaves
Coordinator
- coordinate to performe 2-phase commit
- Paxos groups' leaders
- coordinate leader, coordinate slave
Directory
- the unit of data placement(geographic-replication properties)
  - manage replication and locality
  - when data is moved between Paxos group, it's moved directory by directory
    - 50MB directory can be moved in a few seconds
    - online data movement is available
- a set of contiguous keys that share a common prefix
- this should be called bucket, it's historical reason

Implementations

Spanner uses Paxos to support replication, and put logs every Paxos write twice, once in the tablet's log, once in the Paxos log. Writes initiate Paxos protocol at the leader, but Reads access state directly that is up-to-date. Each spanserver in leader replica implements lock table that uses 2 phase lock to implement concurrency control. Transcation manager is used to support distributed transactions. If a transaction involves only one Paxos group, lock table and Paxos provides transactionality, otherwise leaders corrdinate to do 2 phase commit.

The unit of data placement is directory, but Spanner may shard a directory into multiple fragments that may be served from different Paxos groups, which can be different servers. Movedir task that moves directories actually moves fragments, not directories.

Spanner's data model looks like relational, but it's not purely relational. Every table within a database within a universe is required to have an ordered set of more than one primary key columns. This let Spanner looks like key-value store rather than relational one.

TrueTime API

Spanner uses TrueTime API

to guarantee the correctness properties around concurrency control
to implement some features, like externally consistent transactions, lock-free read-only transactions, non-blocking reads, etc.

APIs of TrueTime API:

TT.now: returns TTinterval(earliest, latest)
TT.after(t): true if t has definitely passed
TT.before(t): true if t has definitely not passed

Spanner implements a set of time master machines per datacenter and a timeslave daemon per machines in order to implement TrueTime API. Some of masters have GPS receivers with dedicated anntenas, rest of them which called Almageddon master have atomic clocks.

Concurrency Control & Transactions

A list of operations that Spanner supports:

Read-Write transaction
- standalone writes
- retried internally
Read-Only transaction
- non-snapshot standalone reads
- retried internally
- must be declared as read-only to distinct from read-write transaction
- executed at a system-chosen(leader-chosen) timestamp without locking
- transaction is performed on up-to-date replicas
Snapshot Read
- executed without locking
- transaction is performed on up-to-date replicas
- timestamp
  - client can specify a timestamp
  - client can provide an upper bound on the desired timestamps

Transactional reads/writes use 2-phase lock.

Read-Write transaction

Characteristics:

buffered at the client until commit
- writes in a transaction can't be seen from reads in the transaction
to avoid deadlock, read use wround-wait
when a client completed all reads and buffered writes, it begins 2-phase commit
- choose a coordinator group
  - and sends a commit message to each participant's leader
    - with the identity of the coordinator
    - with buffered writes

How it works:

A non-corrdinator-participant leader acquires write locks, prepare timestamp, and logs a prepare record through Paxos
Each participant notifies the coordinator of its prepare timestamp
The coordinator leader acquires write locks, skip prepare phase, choose a timestamp for the entire transaction
- The commit timestamp must be greater or equal to all prepare timestamps and any timestamps has assigned previously
  - i.e. >= TT.now().latest
The coordinator leader logs a commit record through Paxos
- Or, abort if it timed out while waiting on the other participants
The coordinator leader waits until TT.after gets true.
- In order to guarantee the commit timestamp is in the past
- called 'commit wait'
The coordinator sends the commit timestamp to the client and all other participant leaders
All participants apply at the same timestamp and release locks

Schema-Change transaction

Spanner supports non-blocking schema-change transaction across datacenters thanks to TrueTime API.

Assign a timestamp in the future
- called t here
Read/Write synchronize with any registered schema-change transaction's timestamp(t)
- if operation's timestamp > t, block behind the schema-change transaction
- else, can proceed

Summary

Spanner offers cross datacenter replications and semi-relational datamodels on top of Bigtable. Transcations are done through Paxos, which is a consensus algorithm. TrueTime API that depends on GPS receivers and atomic clocks gurantees concurrency control in Spanner which supports even schema-change transaction.