赤道上物体线速度:Big Dictionaries II: versioning | Acunu | Big...

来源:百度文库 编辑:九乡新闻网 时间:2024/07/14 04:26:16

Big Dictionaries II: versioning

Andy Twigg | March 7, 2011 |

In Part I, I discussed trade-offs between query and update performance in un-versioned dictionaries; the take-home was that it’s possible to construct an external-memory dictionary that has vastly better update performance than a B-tree by sacrificing some query performance. In this post, we’ll consider the case of versioned dictionaries. A versioned dictionary stores keys and their values with an associated version tree, and supports the following operations:

  • update(key, value, version): associate value to the key in the specified leaf version
  • query(start, end, version): return every key in the range [start,end] together with the value written in the closest ancestor to version
  • clone(version): create a new version as a child of the specified version

A versioned dictionary can be thought of as efficiently implementing the union of many dictionaries: the `live’ keys at version v are the union of all the keys in ancestor versions, where if a key appears more than once, its closest ancestor takes precedence. If the structure supports arbitrary version trees, then we call it (fully-)versioned; if it supports only linear version trees, we call it partially-versioned.

Big Data and versioned dictionaries?

The external-memory dictionary is at the heart of any database or file system, whether it’s ext3, ZFS, BerkeleyDB, MySQL, one of the emerging key-value NoSQL stores (eg Google’s BigTable, Cassandra, Voldemort, etc). File systems (in particular, ZFS and NetApp’s WAFL) and SANs (such as those sold by EMC) have for a long time valued the power that versioned dictionaries offer. But the requirements of emerging `big data’ and key-value stores, such as fast updates, have stopped them from offering efficient versioning . Andrew Byde has written a great post about why versioning is likely to become important for these applications.

A natural question is `why would Big Data stores want versioning?’ Imagine having so much data that you can’t replicate it in order to perform processing or analytics, yet you want to transform or modify the data in multiple independent ways, while receiving live ingests and serving it. For example, efficiently offering customers or developers personalised modifiable views of the same base dataset. Or performing analysis without first copying the entire dataset, and all while live data is entering the data store. All these seem extremely valuable, but they need a platform that can scale and offer fast updates. Traditional versioning stores can typically perform 100s of versioned updates/s, but the Acunu store can perform hundreds of thousands per second – something that now allows a new class of applications to be built.

A bit of history

The B-tree of Bayer et al. is the classic external-memory dictionary. Its classic versioned analogue is the copy-on-write (CoW) B-tree, which first appeared in the Episode File System and is based on the `path-copying’ technique originally presented by Driscoll et al. for making internal-memory pointer-based data structures fully-persistent. Unfortunately the technique does not apply efficiently to external-memory structures, where the `batching’ of requests is very important in order to achieve good performance. Over time, variants of the CoW B-tree have appeared, notably in ZFS, WAFL, Btrfs, and more.

The basic idea is to have a B-tree with many roots, one for each version. Nodes in this B-tree are versioned, and updates can only be done in a node of the same version as the update. When starting an update in version v2, we first ensure that there is suitable root associated to v2 (by duplicating the root for v1, the parent of v2, if necessary), then follow pointers in the same way that one would in a normal B-tree; every time a node is encountered whose version is other than v, it is copied to make a new node of version v2, and the parent node’s pointer is updated to point to the new node, before the update continues down the tree. A lookup proceeds as in a B-tree, starting from the appropriate root.

The CoW B-tree has two major problems that have forced us to invent fundamentally new data structures:

  • Space efficiency: each update may cause an entire new path to be written (see eg this post by RethinkDB – and their implementation is good);
  • Slow updates: each update may require a set of random reads (reading the old path) and a set of writes. Some file systems try to make the writes sequential (eg by embedding the tree into an append-only log as ZFS does, or something potentially more involved such as done by WAFL), but the random reads are still the killer, and anyway, the space blowup means that the random IO demanded by garbage collection will become inevitable on heavily-used systems.

So we have two major questions:

  • Can we achieve the same query/update bound as the CoW B-tree, but with linear space?
  • Can we achieve much better update bounds if we relax the query bound, using linear space? In particular, can we get a fully-versioned analogue of the fractal tree?

We want a versioned dictionary that offers fast (or even better, tunable) updates, linear space usage, and offers great range query performance. Nothing is currently known that offers faster updates than the CoW tree. Interestingly, the space problem was tackled by Becker, who developed the Multiversion B-tree (MVBT) in response to this. However, it only offers slow updates (although this can be partially solved with buffering techniques), and is only partially-versioned. The table below summarises these points (Nv is the total number of keys live in version v, as opposed to N, the total number of keys written in all versions).

Fast forward: two candidate schemes

And now we come to the exciting part: tradeoffs for versioned dictionaries! Let’s start with a simple example: maintaining a chain of partially persistent snapshots. Two obvious ideas come to mind:

  • Idea 1: for each version v, maintain a B-tree that stores every key accessible at v. Then queries and updates to version v are fast — they run in time . But the clone operation on version v has to perform IOs to copy all the data from the B-tree for v into the newly created B-tree for the newly cloned version, and to add insult to injury, the space requirement now grows as something like , rather than . This may not seem like a big deal, but the worst case is terrible: imagine that in the root version we write a billion keys, then in each of 1000 descendant versions write just 1 key. , but . Ouch!
  • Idea 2: instead of copying the entire set of keys on a clone operation, how about we just keep track of the diffs between successive versions? Let Tv be a B-tree on the diff between version v and its parent. Clearly this is pretty much optimal when it comes to space — we store at most elements. Updates are easy, too: find the tree for version v (which is a leaf, recalling that we can only modify leaf versions), and do the B-tree update — this takes time where is the number of keys written exactly at version v (we call such keys lead). What about queries? This is where the idea falls down — one has to check potentially every version to answer a range query, since the “latest version” for a key is not known a-priori. Even with Bloom Filters to help you out, range queries would scale horribly as the number of snapshots increased…

Going beyond CoW: introducing the SDA

Last time, we saw a curve showing roughly the query/update tradeoff landscape for unversioned dictionaries. Now we can show a similar picture for the versioned case:

The Stratified Doubling Array (SDA) and Adaptive Stratified Doubling Array (ASDA) are data structures we’ve developed here at Acunu, that for the first time break through the slow update barrier, while offering optimal space and range query performance. Recalling the previous picture, the SDA can be viewed as a fully-versioned analogue of the Fractal Tree, in terms of its update/query performance. The Adaptive SDA takes this a step further by allowing us to dynamically vary the tradeoff, depending on the workload. More details on that later.

If you’re still wanting some big-O notation, then here are the bounds the SDA achieves: for any constant , it offers space, range queries in IOs, and updates in IOs. Modulo constant factors, it’s impossible to do better than this. Interestingly, it almost always performs sequential IOs without needing a log file system or append-only log, and we can give provable guarantees as to its performance. For users, this means predictable performance and space utilisation, and you can rest assured that you’ve got something that is fundamentally optimal – always good!

In future posts, I’m planning to delve into some inner workings of the SDA, and to consider the tradeoffs and provable guarantees offered by log file systems.