6.824 Spanner lecture note
Spanner
Introduce
Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated data. Spanner is the first global distributed Database and support external consistency. It is a big framework, but I just want to focus on some particular points – Mainly about external consistency and True-Time mechanism.
You may look for more detailed illustration on Google’s Spanner paper or other’s blog about spanner.
Consistency
Assume that if we have 3 Data Centers, we have to send data across all data centers to keep consistency. To keep a strong consistency, Google uses paxos to send logs and reach consensus. Moreover, google has its True-Time mechanism to reach external consistency, but let’s talk about it later.
R/W transactions
For a transaction both need to read and write,spanner uses 2PC.
- Pick a unique transaction ID to identify
- Do all the reads first then do all writes
- Send read request to all leaders in DCs, then DC lock the resources and reply
- Choose a leader as transaction coordinator
- Send writes request to all leaders, and leader gonna send prepared msg to followers into paxos log to make sure leader isn’t crashed and lost lock
- Once one leader finishes promising to followers, it sends a Yes to the coordinator
- Once coordinate received all Yes signal, it start to commit writes to its followers
- Then tell other leader to commit
R/O transactions
For Read-Only transactions, spanner speeds up this type by no 2PC and no locks.
Start with a question that why we not directly read the latest value of each key needed?
Answer:
For a transaction T3 that print(x,y), if we have the timeline of below:
1
2
3 T1 : wx,wy,commit
T2 : wx,wy,commit
T3 : Rx RyT3 just saw x,y yielded by different transaction which breaks the serializability.
From the example above, we know that our read need to fetch data in the same version. So spanner need to at least reach level of Snapshot Isolation.
<h4> Snapshot Isolation</h4>
Spanner gives each transaction a timestamp, which makes all transactions execute in Timestamp order.
1 | R/W's TS = commit time |
Because Spanner has a multi-versions DB, that stores many versions (Not all version but a transient period’s versions). For R/O Xactions, DB can find the value with latest version less than R/O’s start time.
Example:
1 | T1 @TS10: wx,wy,commit() |
Q: what if local replica is minority, how to get latest version less than TS?
A: Every paxos peer gets log from leader, if one follower’s last log’s Timestamp < TS, it will wait for leader’s msg till last log’s TS exceeds required TS
True-Time mechanism
Because of working volts and inner CPU frequency , it is likely every DC’s time can not keep sync without outside’s help. We have two consequence:
R/O transaction’s TS too large
Answer: correct in practice but slow, it will wait for paxo replicas to catch up
R/O transaction’s TS too small
Answer: it may miss recent writes, and not external consistent
In Spanner’s scheme, Google has a satellites to keep synchronize its official time to each DC. In considering of latency in transportation, Spanner give each TT a range to mark its {earliest, latest} arrival time.
1 | TT interval = [Earliest, Latest] |
And we have a start rule:
TS = TT.now(). latest
For R/O, TS is the latest TT on start
For R/W, TS is the latest TT on commit
Commit Delay strategy
- R/W transaction delays till transaction’s commit time < TS.now().earliest
- To make sure commit time in the past.
Extends
There are many details I haven’t covered, if you are interest in them. You can just search on Google for English Blogs, as for Chinese Blogs, I do recommend Google-Spanner 论文的思考 and Spanner, 真时和CAP理论