Transcription of Spanner: Google’s Globally-Distributed Database
1 spanner : Google s Globally-Distributed DatabaseJames C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,Christopher Taylor, Ruth Wang, Dale WoodfordGoogle, is Google s scalable, multi-version, Globally-Distributed , and synchronously-replicated Database . It isthe first system to distribute data at global scale and sup-port externally-consistent distributed transactions. Thispaper describes how spanner is structured, its feature set,the rationale underlying various design decisions, and anovel time API that exposes clock uncertainty.
2 This APIand its implementation are critical to supporting exter-nal consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transac-tions, and atomic schema changes, across all of IntroductionSpanner is a scalable, Globally-Distributed Database de-signed, built, and deployed at Google. At the high-est level of abstraction, it is a Database that shards dataacross many sets of Paxos [21] state machines in data-centers spread all over the world. Replication is used forglobal availability and geographic locality; clients auto-matically failover between replicas. spanner automati-cally reshards data across machines as the amount of dataor the number of servers changes, and it automaticallymigrates data across machines (even across datacenters)to balance load and in response to failures.
3 spanner isdesigned to scale up to millions of machines across hun-dreds of datacenters and trillions of Database can use spanner for high availability,even in the face of wide-area natural disasters, by repli-cating their data within or even across continents. Ourinitial customer was F1 [35], a rewrite of Google s ad-vertising backend. F1 uses five replicas spread acrossthe United States. Most other applications will probablyreplicate their data across 3 to 5 datacenters in one ge-ographic region, but with relatively independent failuremodes. That is, most applications will choose lower la-tency over higher availability, as long as they can survive1 or 2 datacenter s main focus is managing cross-datacenterreplicated data, but we have also spent a great deal oftime in designing and implementing important databasefeatures on top of our distributed -systems though many projects happily use Bigtable [9], wehave also consistently received complaints from usersthat Bigtable can be difficult to use for some kinds of ap-plications: those that have complex, evolving schemas,or those that want strong consistency in the presence ofwide-area replication.
4 (Similar claims have been madeby other authors [37].) Many applications at Googlehave chosen to use Megastore [5] because of its semi-relational data model and support for synchronous repli-cation, despite its relatively poor write throughput. As aconsequence, spanner has evolved from a Bigtable-likeversioned key-value store into a temporal multi-versiondatabase. Data is stored in schematized semi-relationaltables; data is versioned, and each version is automati-cally timestamped with its commit time; old versions ofdata are subject to configurable garbage-collection poli-cies; and applications can read data at old supports general-purpose transactions, and pro-vides a SQL-based query a Globally-Distributed Database , spanner providesseveral interesting features.
5 First, the replication con-figurations for data can be dynamically controlled at afine grain by applications. Applications can specify con-straints to control which datacenters contain which data,how far data is from its users (to control read latency),how far replicas are from each other (to control write la-tency), and how many replicas are maintained (to con-trol durability, availability, and read performance). Datacan also be dynamically and transparently moved be-tween datacenters by the system to balance resource us-age across datacenters. Second, spanner has two featuresthat are difficult to implement in a distributed Database : itPublished in the Proceedings of OSDI 20121provides externally consistent [16] reads and writes, andglobally-consistent reads across the Database at a time-stamp.
6 These features enable spanner to support con-sistent backups, consistent MapReduce executions [12],and atomic schema updates, all at global scale, and evenin the presence of ongoing features are enabled by the fact that spanner as-signs globally-meaningful commit timestamps to trans-actions, even though transactions may be timestamps reflect serialization order. In addition,the serialization order satisfies external consistency (orequivalently, linearizability [20]): if a transactionT1commits before another transactionT2starts, thenT1 scommit timestamp is smaller thanT2 s. spanner is thefirst system to provide such guarantees at global key enabler of these properties is a new TrueTimeAPI and its implementation.
7 The API directly exposesclock uncertainty, and the guarantees on spanner s times-tamps depend on the bounds that the implementation pro-vides. If the uncertainty is large, spanner slows down towait out that uncertainty. Google s cluster-managementsoftware provides an implementation of the TrueTimeAPI. This implementation keeps uncertainty small (gen-erally less than 10ms) by using multiple modern clockreferences (GPS and atomic clocks).Section 2 describes the structure of spanner s imple-mentation, its feature set, and the engineering decisionsthat went into their design. Section 3 describes our newTrueTime API and sketches its implementation. Sec-tion 4 describes how spanner uses TrueTime to imple-ment externally-consistent distributed transactions, lock-free read-only transactions, and atomic schema 5 provides some benchmarks on spanner s per-formance and TrueTime behavior, and discusses the ex-periences of F1.
8 Sections 6, 7, and 8 describe related andfuture work, and summarize our ImplementationThis section describes the structure of and rationale un-derlying spanner s implementation. It then describes thedirectoryabstraction, which is used to manage replica-tion and locality, and is the unit of data movement. Fi-nally, it describes our data model, why spanner lookslike a relational Database instead of a key-value store, andhow applications can control data spanner deployment is called auniverse. Giventhat spanner manages data globally, there will be onlya handful of running currently run atest/playground universe, a development/production uni-verse, and a production-only is organized as a set ofzones, where eachzone is the rough analog of a deployment of BigtableFigure 1: spanner server [9].
9 Zones are the unit of administrative deploy-ment. The set of zones is also the set of locations acrosswhich data can be replicated. Zones can be added to orremoved from a running system as new datacenters arebrought into service and old ones are turned off, respec-tively. Zones are also the unit of physical isolation: theremay be one or more zones in a datacenter, for example,if different applications data must be partitioned acrossdifferent sets of servers in the same 1 illustrates the servers in a spanner zone has onezonemasterand between one hundredand several thousandspanservers. The former assignsdata to spanservers; the latter serve data to clients. Theper-zonelocation proxiesare used by clients to locatethe spanservers assigned to serve their data.
10 Theuni-verse masterand theplacement driverare currently sin-gletons. The universe master is primarily a console thatdisplays status information about all the zones for inter-active debugging. The placement driver handles auto-mated movement of data across zones on the timescaleof minutes. The placement driver periodically commu-nicates with the spanservers to find data that needs to bemoved, either to meet updated replication constraints orto balance load. For space reasons, we will only describethe spanserver in any Spanserver Software StackThis section focuses on the spanserver implementationto illustrate how replication and distributed transactionshave been layered onto our Bigtable-based implementa-tion.