Reading Note - Spanner paper
2020-08-01
DBReadingThis post is my reading note for the paper of Spanner. Distributed SQL database and also NewSQL are a kind of trends in our industry. I think that Spanner, Google's globally-distributed database, is the beginning of such era.
The paper is available here.
Yugabyte's blog posts describe what distributed SQL and NewSQL are. If you are interested in, please refer to them.
- https://blog.yugabyte.com/what-is-distributed-sql/
- https://blog.yugabyte.com/distributedsql-vs-newsql/
Note that this memo may include misunderstandings and even be wrong.
Motivation
The main motiviation of why Google invented Spanner was to manage cross-datacenter replicated data. They have Bigtable as their distributed system infrastracture, however, it seemed to be not sufficient to them in terms of managing complex applications and evolving schemas. In order to solve them, Spanner, a Bigtable-based database, offers semi-relational tables on top of versioned key-value store and Paxos as its consensus algorithm.
Overview
- Spanner is a database that shards data across many sets of Paxos state machines, which are located in datacenters all over the world.
- Paxos is a well-known consensus algorithm. Raft, another well-known one, is used in CockroachDB, and other NewSQLs.
- In order to manage cross-datacenter replicated data
- Other Google's products used to use Bigtable, but it was difficult to manage complex and evolving schemas.
- Alternatively, Megastore have been chosen that offers semi-relational datamodel and synchronous replication support with poor write throughput.
- Bigtable-like versioned key-value store
- Semi-relational datamodel
- Each version is automatically timestamped with its commit time
- Old versions will be removed by GC
- Provides transaction
- SQL based query language
- Replication configuration can be dynamically configured by applications
- Which datacenter contains which data
- To control read latency: how far data is from its users
- To control write latency: how far replicas are from each other
- Externally consist read, write, and globally-consist reads at a timestamp
- Atomic schema updates all at global scale
- Timestamps in Spanner reflect serialization order
- TrueTime API
- Exposes clock uncertainty
- How uncertain Spanner's timestamps are
- As it gets less uncertainty, Spanner gets slower
- Using GPS and atomic clocks
- Exposes clock uncertainty
Terminologies
- Universe master
- singleton
- a console that display status about all the zones
- Placement driver
- singleton
- handles automated movement of data across zones
- timescale of minutes
- periodically communicates with the spanservers to find data that needs to be moved
- Universe
- Spanner deployment
- something like test/production
- Zone
- unit of administrative deployment
- unit of physical isolation
- one or more zones in a datacenter
- has one zonemaster
- has one location proxy
- has hundred ~ thousands of spanserver
- Zonemaster
- assign data to spanserver
- Spanserver
- serve data to clients
- Location proxy
- to locate the spanservers assigned to serve their data
- Tablet
- data structure
- Spanner assigns timestamps to data, Bigtable doesn't
- a set of
(key, timestamp) -> value
- called
mappings
- called
- state is stored in set of B-tree like files and write-ahead log
- all on a distributed file system called Colossus
- spanserver implements a single Paxos state machines on top of each tablet
- also stores metadata within them
- data structure
- Paxos group
- a set of replicas
- Transaction manager
- to support distriuted transactions
- participant leader, participant slaves
- Coordinator
- coordinate to performe 2-phase commit
- Paxos groups' leaders
- coordinate leader, coordinate slave
- Directory
- the unit of data placement(geographic-replication properties)
- manage replication and locality
- when data is moved between Paxos group, it's moved directory by directory
- 50MB directory can be moved in a few seconds
- online data movement is available
- a set of contiguous keys that share a common prefix
- this should be called
bucket
, it's historical reason
- the unit of data placement(geographic-replication properties)
Implementations
Spanner uses Paxos to support replication, and put logs every Paxos write twice, once in the tablet's log, once in the Paxos log. Writes initiate Paxos protocol at the leader, but Reads access state directly that is up-to-date. Each spanserver in leader replica implements lock table that uses 2 phase lock to implement concurrency control. Transcation manager is used to support distributed transactions. If a transaction involves only one Paxos group, lock table and Paxos provides transactionality, otherwise leaders corrdinate to do 2 phase commit.
The unit of data placement is directory
, but Spanner may shard a directory into multiple fragments
that may be served from different Paxos groups, which can be different servers.
Movedir task that moves directories actually moves fragments, not directories.
Spanner's data model looks like relational, but it's not purely relational. Every table within a database within a universe is required to have an ordered set of more than one primary key columns. This let Spanner looks like key-value store rather than relational one.
TrueTime API
Spanner uses TrueTime API
- to guarantee the correctness properties around concurrency control
- to implement some features, like externally consistent transactions, lock-free read-only transactions, non-blocking reads, etc.
APIs of TrueTime API:
TT.now
: returnsTTinterval(earliest, latest)
TT.after(t)
: true ift
has definitely passedTT.before(t)
: true ift
has definitely not passed
Spanner implements a set of time master machines per datacenter and a timeslave daemon per machines in order to implement TrueTime API. Some of masters have GPS receivers with dedicated anntenas, rest of them which called Almageddon master have atomic clocks.
Concurrency Control & Transactions
A list of operations that Spanner supports:
- Read-Write transaction
- standalone writes
- retried internally
- Read-Only transaction
- non-snapshot standalone reads
- retried internally
- must be declared as read-only to distinct from read-write transaction
- executed at a system-chosen(leader-chosen) timestamp without locking
- transaction is performed on up-to-date replicas
- Snapshot Read
- executed without locking
- transaction is performed on up-to-date replicas
- timestamp
- client can specify a timestamp
- client can provide an upper bound on the desired timestamps
Transactional reads/writes use 2-phase lock.
Read-Write transaction
Characteristics:
- buffered at the client until commit
- writes in a transaction can't be seen from reads in the transaction
- to avoid deadlock, read use wround-wait
- when a client completed all reads and buffered writes, it begins 2-phase commit
- choose a coordinator group
- and sends a commit message to each participant's leader
- with the identity of the coordinator
- with buffered writes
- and sends a commit message to each participant's leader
- choose a coordinator group
How it works:
- A non-corrdinator-participant leader acquires write locks, prepare timestamp, and logs a prepare record through Paxos
- Each participant notifies the coordinator of its prepare timestamp
- The coordinator leader acquires write locks, skip prepare phase, choose a timestamp for the entire transaction
- The commit timestamp must be greater or equal to all prepare timestamps and any timestamps has assigned previously
- i.e.
>= TT.now().latest
- i.e.
- The commit timestamp must be greater or equal to all prepare timestamps and any timestamps has assigned previously
- The coordinator leader logs a commit record through Paxos
- Or, abort if it timed out while waiting on the other participants
- The coordinator leader waits until
TT.after
gets true.- In order to guarantee the commit timestamp is in the past
- called 'commit wait'
- The coordinator sends the commit timestamp to the client and all other participant leaders
- All participants apply at the same timestamp and release locks
Schema-Change transaction
Spanner supports non-blocking schema-change transaction across datacenters thanks to TrueTime API.
- Assign a timestamp in the future
- called
t
here
- called
- Read/Write synchronize with any registered schema-change transaction's timestamp(
t
)- if operation's timestamp >
t
, block behind the schema-change transaction - else, can proceed
- if operation's timestamp >
Summary
Spanner offers cross datacenter replications and semi-relational datamodels on top of Bigtable. Transcations are done through Paxos, which is a consensus algorithm. TrueTime API that depends on GPS receivers and atomic clocks gurantees concurrency control in Spanner which supports even schema-change transaction.