Concatenating red-black trees

J D picture J D · Jul 5, 2010 · Viewed 8.1k times · Source

The OCaml standard library has a wonderful Set implementation that uses a very efficient divide-and-conquer algorithm to compute the union of two sets. I believe it takes whole subtrees (not just single elements) from one set and inserts them into the other set, rebalancing when necessary.

I'm wondering if this requires the height information that is kept in the AVL tree that OCaml uses or if this is also possible with red-black trees. For example, is it possible to concatenate a pair of red-black trees more efficiently than simply iterating over the second tree appending its elements to the end of the first tree?

Answer

jbapple picture jbapple · Jul 11, 2010

I'm not sure from the wording of your question if you are interested in set union or concatenation or both, or if you're only interested in persistent data structures as are common in OCaml or also in ephemeral structures.

An implementation of red-black trees with fingers is described by Heather D. Booth in a chapter from her thesis. With fingers, a red-black tree of size n can be split into two trees of size p and q in amortized O(lg (min (p,q))) time and two red-black trees of size p and q can be concatenated in the same bound. Additionally, an element can be added or deleted at either end of an rb tree in amortized O(1) time. With these operations, it is possible to achieve amortized O(p lg(q/p)) time set union (for p < q), which is information-theoretically optimal. Perhaps the key idea to get these bounds is the reversal of the child pointers on the left and right spines.

The bounds above are amortized in the traditional sense. For a functional language like OCaml, one might wish to have bounds that apply when a data structure is used persistently. I do not think Booth's description will achieve all of those bounds when the trees are used persistently. For example, insertion at a finger can take ω(1) recolorings. This might be solved via the lazy recolorings discussed in Driscoll et al.'s "Making Data Structures Persistent".

On the other hand, I think Booth's analysis might show that concatenation is still O(lg (max (p,q))) even when used persistently. I'm less optimistic about the set union bound.

Set operations with asymptotically optimal time bounds are possible in a functional setting. Those described by Hinze & Paterson achieve the bounds in an amortized (but persistent) sense, the treaps described by Blandford & Blelloch achieve the bounds in a randomized sense, and those described by Kaplan & Tarjan achieve them in worst-case time. The latter also offer O(lg lg (min(p,q))) concatenation, though Hinze & Paterson are dubious of that claim. These trees are not a direct answer to your question, which is specific to red-black trees, but they hopefully give a flavor of what is possible, and the H&P paper includes code, and has been verified correct using Coq, which can extract to OCaml code.

Two more pointers you might be interested in: Brodal et al. presented search trees with O(lg n) find, insert, and delete and O(1) concat even in a functional setting. Additionally, Atallah et al. claim to describe a red-black tree that has amortized O(1) concat (presumably ephemerally only), but Buchsbaum and Goodrich claim that there are several flaws in that structure.

One final note about the utility of red-black trees: in one of the comments on one of the answers to this question, you say:

The only advantage of a red-black tree is that the auxiliary information (red or black) is only 1-bit per branch. By adding height, you've lost that advantage and might as well just use a height-balanced tree instead.

There are other advantages as well. For instance, some data structures used in computational geometry are based on binary search trees but have a high cost of tree rotation. Red-black trees can be rebalanced in at most 3 rotations per insert and delete, while AVL trees can take Ω(lg n) rotations for these operations. As Ralf Hinze noticed, Okasaki's rebalancing scheme for red-black trees (code available in ML, Haskell, Java, and Ada) does not offer the same bound, and can end up doing Ω(lg n) rotations on insertion. (Okasaki does not present deletion.)

Additionally, height-balanced search trees (and even AVL trees) can be stored so as to use only one bit of balance information per node. Some trees have only two possible balance positions at each node, like one-sided height-balanced trees, but trees with up to four possible balance positions per node can store one bit of balance information in each child, as initially explained by Brown and later expanded upon by Haeupler et al.

Edit:

In answer to your specific query at the end of your question, here is a description of an algorithm for concatenating two red-black trees. It takes O(lg(max(|L|,|R|))) time, which is too long to get the asymptotically optimal union time I describe above. For comparison, I expect that the "join" implementation for AVL sets in OCaml's stdlib gets O(h1-h2) performance, where h1 is the height of the taller tree, though it actually joins two AVL trees given an element that fits between them, while the algorithm below has to find and remove that mortar element from one of its arguments. You could avoid that by only storing elements at the leaves, as in a B+ tree, but that has a space penalty of having to keep a bunch of pointers to elements in the non-leaf nodes to guide search. In any case, it wouldn't make join constant time for trees of the same height like the AVL join code in the OCaml stdlib, since you would still have to calculate the black height of each tree, as explained below.

Given two non-empty red-black trees L and R, we will produce a new red-black tree that is the concatenation of L and R. This will take time proportional to O(lg (max(|L|,|R|))), where |L| denotes the number of nodes in L.

First, remove the largest element from L, c. Next, find the black height of L and R. By "black height", I mean the number of black nodes on any path from the root to a leaf. By the red-black tree invariants, this is constant on all paths of any given tree. Call L's black height p and R's black height q, and assume w.l.o.g. p ≤ q.

From the root of R, follow left children until arriving at a black node R' with height p. Make a new red tree C with root element c, left child L and right child R'. Since L is a red-black tree on its own, its root is black, and the color invariants are not violated at or below C. Furthermore, the black height of C is p.

However, we cannot simply splice C back into R in place of R'. First, if p = q, R' is R, yet C has a red root. In this case, simply recolor the root of C black. This is your new concatenated tree.

Second, if R' is not the root, it may have a red parent. Red parents are not permitted to have red children, so we must rebalance. Here we just apply Okasaki's rebalancing scheme all the way up the spine between R' (now replaced with C) and the root of R.

There are two possible cases. If C has no grandparent, color C's parent black. The tree is now valid.

If C has a grandparent, it must be black and of black height p+1, since C's parent is red. Replace C's grandparent with a new red tree, the root of which is the root of C's parent, the left child of which is C, recolored black, and the right child of which is a black tree that consists of C's sibling, C's grandparent's root, and C's uncle, in that order. This doesn't increase the black height of C's grandparent, but it changes its color to red, which might make it a root or a red child of a red parent, so we have to rebalance again, and so on all the way up the tree

  • Finding the black height of both trees : O(lg |L|) + O(lg |R|)
  • Tracing down R to the right spot: O(lg |R| - lg |L|)
  • Rotations all the way back up to the root: O(lg |R| - lg |L|)

None of these is greater than O(lg |R| + lg |L|) = O(lg (max(|L|,|R|)))

To make this O(lg (min(|L|,|R|))), first reverse the spine pointers. Then you don't need the black height of the larger tree, you only need to count black spine nodes until one tree runs out of spine. Then, use the original (not Okasaki's) rebalancing scheme to make sure you only rebalance O(1) nodes. Finally, mark the rest of the spine that doesn't need rebalancing for lazy recoloring if necessary later.