Spanner

Introduce

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated data. Spanner is the first global distributed Database and support external consistency. It is a big framework, but I just want to focus on some particular points – Mainly about external consistency and True-Time mechanism.

You may look for more detailed illustration on Google’s Spanner paper or other’s blog about spanner.

Consistency

spanner1

Assume that if we have 3 Data Centers, we have to send data across all data centers to keep consistency. To keep a strong consistency, Google uses paxos to send logs and reach consensus. Moreover, google has its True-Time mechanism to reach external consistency, but let’s talk about it later.

R/W transactions

For a transaction both need to read and write,spanner uses 2PC.

  • Pick a unique transaction ID to identify
  • Do all the reads first then do all writes
  • Send read request to all leaders in DCs, then DC lock the resources and reply
  • Choose a leader as transaction coordinator
  • Send writes request to all leaders, and leader gonna send prepared msg to followers into paxos log to make sure leader isn’t crashed and lost lock
  • Once one leader finishes promising to followers, it sends a Yes to the coordinator
  • Once coordinate received all Yes signal, it start to commit writes to its followers
  • Then tell other leader to commit

R/O transactions

For Read-Only transactions, spanner speeds up this type by no 2PC and no locks.

Start with a question that why we not directly read the latest value of each key needed?

Answer:

For a transaction T3 that print(x,y), if we have the timeline of below:

1
2
3
T1 :  wx,wy,commit
T2 : wx,wy,commit
T3 : Rx Ry

T3 just saw x,y yielded by different transaction which breaks the serializability.

From the example above, we know that our read need to fetch data in the same version. So spanner need to at least reach level of Snapshot Isolation.

<h4> Snapshot Isolation</h4>

Spanner gives each transaction a timestamp, which makes all transactions execute in Timestamp order.

1
2
R/W's TS = commit time
R/O's TS = start time

Because Spanner has a multi-versions DB, that stores many versions (Not all version but a transient period’s versions). For R/O Xactions, DB can find the value with latest version less than R/O’s start time.

Example:

1
2
3
4
5
6
T1 @TS10:  wx,wy,commit()
⬆x@10 = 9, y@10 = 10
T2 @TS20: wx,wy,commit
⬆x@20 = 8, y@10 = 12
T3 @TS15: Rx Ry
⬆ read the TS of 10

Q: what if local replica is minority, how to get latest version less than TS?

A: Every paxos peer gets log from leader, if one follower’s last log’s Timestamp < TS, it will wait for leader’s msg till last log’s TS exceeds required TS

True-Time mechanism

Because of working volts and inner CPU frequency , it is likely every DC’s time can not keep sync without outside’s help. We have two consequence:

  1. R/O transaction’s TS too large

    Answer: correct in practice but slow, it will wait for paxo replicas to catch up

  2. R/O transaction’s TS too small

    Answer: it may miss recent writes, and not external consistent

In Spanner’s scheme, Google has a satellites to keep synchronize its official time to each DC. In considering of latency in transportation, Spanner give each TT a range to mark its {earliest, latest} arrival time.

1
TT interval = [Earliest, Latest]

And we have a start rule:

  • TS = TT.now(). latest

  • For R/O, TS is the latest TT on start

  • For R/W, TS is the latest TT on commit

  • Commit Delay strategy

    • R/W transaction delays till transaction’s commit time < TS.now().earliest
    • To make sure commit time in the past.

Extends

There are many details I haven’t covered, if you are interest in them. You can just search on Google for English Blogs, as for Chinese Blogs, I do recommend Google-Spanner 论文的思考 and Spanner, 真时和CAP理论