RedblackImplementation and Proof of Red-Black Trees

Required Reading

(1) General background on red-black trees,
  • Section 3.3 of Algorithms, Fourth Edition, by Sedgewick and Wayne, Addison Wesley 2011; or
  • Chapter 13 of Introduction to Algorithms, 3rd Edition, by Cormen, Leiserson, and Rivest, MIT Press 2009
  • or Wikipedia.
(2) an explanation of the particular implementation we use here. Red-Black Trees in a Functional Setting, by Chris Okasaki. Journal of Functional Programming, 9(4):471-477, July 1999. https://www.cs.tufts.edu/~nr/cs257/archive/chris-okasaki/redblack99.pdf
(3) Optional reading: Efficient Verified Red-Black Trees, by Andrew W. Appel, September 2011. http://www.cs.princeton.edu/~appel/papers/redblack.pdf

Implementation

Red-black trees are a form of binary search tree (BST), but with balance. Recall that the depth of a node in a tree is the distance from the root to that node. The height of a tree is the depth of the deepest node. The insert or lookup function of the BST algorithm (Chapter SearchTree) takes time proportional to the depth of the node that is found (or inserted). To make these functions run fast, we want trees where the worst-case depth (or the average depth) is as small as possible.
In a perfectly balanced tree of N nodes, every node has depth less than or or equal to log N, using logarithms base 2. In an approximately balanced tree, every node has depth less than or equal to 2 log N. That's good enough to make insert and lookup run in time proportional to log N.
The trick is to mark the nodes Red and Black, and by these marks to know when to locally rebalance the tree. For more explanation and pictures, see the Required Reading above.
We will use the same framework as in Extract: keys are Ocaml integers. We don't repeat the Extract commands, because they are imported implicitly from Extract.v

From VFA Require Import Perm.
From VFA Require Import Extract.
From Coq Require Import Logic.FunctionalExtensionality.
Open Scope Z_scope.

Definition key := int.

Inductive color := Red | Black.

Section TREES.
Variable V : Type.
Variable default: V.

Inductive tree : Type :=
 | E : tree
 | T: color tree key V tree tree.

Definition empty_tree := E.
lookup is exactly as in our (unbalanced) search-tree algorithm in Extract, except that the T constructor carries a color component, which we can ignore here.

Fixpoint lookup (x: key) (t : tree) : V :=
  match t with
  | Edefault
  | T _ tl k v trif ltb x k then lookup x tl
                         else if ltb k x then lookup x tr
                         else v
  end.
The balance function is copied directly from Okasaki's paper. Now, the nice thing about machine-checked proof in Coq is that you can prove this correct without actually understanding it! So, do read Okasaki's paper, but don't worry too much about the details of this balance function.
In contrast, Sedgewick has proposed left-leaning red-black trees, which have a simpler balance function (but a more complicated invariant). He does this in order to make the proof of correctness easier: there are fewer cases in the balance function, and therefore fewer cases in the case-analysis of the proof of correctness of balance. But as you will see, our proofs about balance will have automated case analyses, so we don't care how many cases there are!

Definition balance rb t1 k vk t2 :=
 match rb with RedT Red t1 k vk t2
 | _
 match t1 with
 | T Red (T Red a x vx b) y vy c
      T Red (T Black a x vx b) y vy (T Black c k vk t2)
 | T Red a x vx (T Red b y vy c) ⇒
      T Red (T Black a x vx b) y vy (T Black c k vk t2)
 | amatch t2 with
            | T Red (T Red b y vy c) z vz d
                T Red (T Black t1 k vk b) y vy (T Black c z vz d)
            | T Red b y vy (T Red c z vz d) ⇒
                T Red (T Black t1 k vk b) y vy (T Black c z vz d)
            | _T Black t1 k vk t2
            end
  end
 end.
(In contrast, Sedgewick has proposed left-leaning red-black trees, which have a simpler balance function (but a more complicated invariant). He does this in order to make the proof of correctness easier: there are fewer cases in the balance function, and therefore fewer cases in the case-analysis of the proof of correctness of balance. But as you will see, our proofs about balance will have automated case analyses, so we don't care how many cases there are!)
The insert operation uses a helper function ins to do the insertion, and another function makeBlack to color the resulting tree's root black.

Definition makeBlack t :=
  match t with
  | EE
  | T _ a x vx bT Black a x vx b
  end.

Fixpoint ins x vx s :=
 match s with
 | ET Red E x vx E
 | T c a y vy bif ltb x y then balance c (ins x vx a) y vy b
                  else if ltb y x then balance c a y vy (ins x vx b)
                  else T c a x vx b
 end.

Definition insert x vx s := makeBlack (ins x vx s).
The elements function does not change except to ignore the color component of T.

Fixpoint elements' (s: tree) (base: list (key×V)) : list (key × V) :=
 match s with
 | Ebase
 | T c a k v belements' a ((k,v) :: elements' b base)
 end.

Definition elements (s: tree) : list (key × V) := elements' s nil.
Now that the program has been defined, it's time to prove its properties. A red-black tree has two kinds of invariants:
  • BST: the keys in each left subtree are all less than the node's key, and the keys in each right subtree are greater
  • Balanced: there is the same number of black nodes on any path from the root to each leaf; and there are never two red nodes in a row.
First, we'll look at some automation methodology.

Proof Automation for Case-Analysis Proofs.

Let's start by proving a seemingly obvious fact about ins, namely that the result of any ins operation is a non-empty tree. (We don't actually need this fact, but it makes a good example.) The proof turns out to be surprisingly complicated, because of the need for a big case analysis over all the clauses of the balance function. These proofs are very tedious to do "by hand," but are easy to automate.

Proof Automation for Case-Analysis Proofs.

Several of the proofs for red-black trees require a big case analysis over all the clauses of the balance function. These proofs are very tedious to do by hand, but are easy to automate. Let's get some practice by automating a proof that insertion produces a non-empty tree.

Lemma T_neq_E:
   c l k v r, T c l k v r E.
Proof.
  unfold not. intros. discriminate.
Qed.

Lemma ins_not_E: x vx s, ins x vx s E.
Proof.
  intros. destruct s; simpl.
  discriminate. (* This form of the discriminate tactic is shorthand
                   for intro X; discriminate X. *)

  remember (ins x vx s1) as a1.
  unfold balance.
Let's destruct on the topmost case, ltb x k. We can use destruct instead of bdestruct because we don't need to remember whether x<k or xk.

  destruct (ltb x k).
  (* The topmost test is match c with..., so just destruct c *)
  destruct c.
  (* This one is easy. *)
  discriminate.
  (* The topmost test is match a1 with..., so just destruct a1 *)
  destruct a1.
  (* The topmost test is match s2 with..., so just destruct s2 *)
  destruct s2.
  (* This one is easy. *)
  discriminate.
How long will this go on? A long time! But it will terminate. Just keep typing. Better yet, let's automate. The following tactic applies whenever the current goal looks like
match ?c with Red _ | Black _ end _
and what it does in that case is destruct c

  match goal with
  | ⊢ match ?c with Red_ | Black_ end _destruct c
  end.
The following tactic applies whenever the current goal looks like
match ?s with E _ | T _ _ _ _ _ _ end _
and what it does in that case is destruct s

  match goal with
      | ⊢ match ?s with E_ | T _ _ _ _ __ end _destruct s
  end.
Let's apply that tactic again, and then try it on the subgoals, recursively. Recall that the repeat tactical keeps trying the same tactic on subgoals.

  repeat match goal with
         | ⊢ match ?s with E_ | T _ _ _ _ __ end _destruct s
         end.
  match goal with
  | ⊢ T _ _ _ _ _ Ediscriminate
  end.
This seems promising. Let's start the proof all over again and incorporate this automation.

Abort.

Lemma ins_not_E: x vx s, ins x vx s E.
Proof.
  intros. destruct s; simpl.
  discriminate.
  remember (ins x vx s1) as a1.
  unfold balance.
This is the beginning of the big case analysis. This time, let's combine several tactics together:
  repeat match goal with
         | ⊢ (if ?x then _ else _) _destruct x
         | ⊢ match ?c with Red_ | Black_ end _destruct c
         | ⊢ match ?s with E_ | T _ _ _ _ __ end _destruct s
         end.
What we have left is 117 cases, every one of which can be proved the same way:
  discriminate.
  discriminate.
  discriminate.
  discriminate.
  discriminate.
  discriminate.
Only 111 cases to go...
  discriminate.
  discriminate.
  discriminate.
  discriminate.
Only 107 cases to go... There has to be a better way, and there is:
  all: discriminate.
The all: goal selector applies a tactic to all remaining subgoals. Alternatively, we can restart the proof yet again, and add one more clause to the match goal tactic...
Abort.
Let's start over again, adding that apply to the case analysis.

Lemma ins_not_E: x vx s, ins x vx s E.
Proof.
  intros. destruct s; simpl.
  discriminate.
  remember (ins x vx s1) as a1.
  unfold balance.
  repeat match goal with
         | ⊢ (if ?x then _ else _) _destruct x
         | ⊢ match ?c with Red_ | Black_ end _destruct c
         | ⊢ match ?s with E_ | T _ _ _ _ __ end _destruct s
         | ⊢ T _ _ _ _ _ Ediscriminate
         end.
Qed.

The BST Property

The Binary Seach Tree (BST) property for red-black trees is exactly the same as for ordinary searchtrees (we just ignore the color of each node). We copy the relevant definitions from SearchTree, also changing to use int rather than nat.
ForallT P t holds if P k v holds for every (k,v) node of the tree t.

Fixpoint ForallT (P: int V Prop) (t: tree) : Prop :=
 match t with
 | ETrue
 | T c l k v rP k v ForallT P l ForallT P r
 end.

Inductive BST : tree Prop :=
| ST_E : BST E
| ST_T : c l k v r,
    ForallT (fun k' _(int2Z k') < (int2Z k)) l
    ForallT (fun k' _(int2Z k') > (int2Z k)) r
    BST l
    BST r
    BST (T c l k v r).

Exercise: 1 star, standard (empty_tree_BST)

Lemma empty_tree_BST: BST empty_tree.
Proof.
(* FILL IN HERE *) Admitted.
Now, let's proceed to show that insert preserves the BST property. This is a really important result, because it turns out that for red-black trees (unlike the simpler searchtrees in SearchTree) the BST property is required to make insert and lookup interact correctly.
We'll address this task in easy stages. First we prove that if t is a BST, then the rebalanced version of t is also a BST. This turns out to be quite a bit of work, but automation helps a great deal.
Lemma balance_BST:
  c l k v r,
   ForallT (fun k' _(int2Z k') < (int2Z k)) l
   ForallT (fun k' _(int2Z k') > (int2Z k)) r
   BST l
   BST r
   BST (balance c l k v r).
Proof.
  intros c l k v r PL PR BL BR.
  unfold balance.
Use proof automation for this case analysis.

  repeat match goal with
          | ⊢ BST (match ?c with Red_ | Black_ end) ⇒ destruct c
          | ⊢ BST (match ?s with E_ | T _ _ _ _ __ end) ⇒ destruct s
          end.
58 cases to consider!

  - constructor. auto. auto. auto. auto.
  - constructor; auto.
  - constructor; auto.
  - constructor; auto.
To prove this one, we need to destruct a hypothesis above the line.

    + simpl. repeat split.
       destruct PR as [? [? ?]].
       omega.
And for this one, we first need to both invert and destruct above the line.
    + repeat split.
      × inv BR.
        destruct H5 as [? [? ?]].
        omega.
      × inv BR.
        destruct H5 as [? [? ?]].
        auto.
      × inv BR.
        destruct H5 as [? [? ?]].
        auto.
    + constructor; auto.
This time, let's do the inversion and destruction before deconstructing.
    + inv BR.
      inv H7.
      constructor; auto.
  - constructor; auto.
Hmm. Another complicated case. Rather than working through it, let's try the proof again with stronger automation based on what we learned above. Whenever we have a hypothesis above the line that looks like
  • H: BST (T _ _ _ _ _)
  • H: ForallT _ (T _ _ _ _ _)
we should invert or destruct it, and if we have a goal below the line of the form
ForallT _ (T _ _ _ _ _)
we should split it up. Otherwise we try
try constructor; auto; try omega

Abort.

Lemma balance_BST:
  c l k v r,
   ForallT (fun k' _(int2Z k') < (int2Z k)) l
   ForallT (fun k' _(int2Z k') > (int2Z k)) r
   BST l
   BST r
   BST (balance c l k v r).
Proof.
  intros.
  unfold balance.
Use heavier-duty proof automation for this case analysis.

  repeat (match goal with
          | ⊢ BST (match ?c with Red_ | Black_ end) ⇒ destruct c
          | ⊢ BST (match ?s with E_ | T _ _ _ _ __ end) ⇒ destruct s
          | ⊢ ForallT _ (T _ _ _ _ _) ⇒ repeat split
          | H: ForallT _ (T _ _ _ _ _) ⊢ _destruct H as [? [? ?] ]
          | H: BST (T _ _ _ _ _) ⊢ _inv H
          end;
          (try constructor; auto; try omega)).
(Only) 41 cases to consider now! A little disappointing that we didn't clear more cases, but let's look at why are we stuck.
The first goal follows from the hypotheses
[ ForallT (fun (k' : int) (_ : V) int2Z k' > int2Z k0) r2 ]
and
[ int2Z k1 < int2Z k0 ]
but we need a lemma about >, to be proved inductively over r2. The other goals look similar.

Abort.

Lemma ForallT_imp : (P Q : int V Prop) t,
  ForallT P t
  ( k v, P k v Q k v)
  ForallT Q t.
Proof.
  induction t; intros.
  - auto.
  - destruct H as [? [? ?]]. repeat split; auto.
Qed.

Lemma ForallT_greater : t k k0,
    ForallT (fun k' _int2Z k' > int2Z k) t
    int2Z k > int2Z k0
    ForallT (fun k' _int2Z k' > int2Z k0) t.
Proof.
  intros; eapply ForallT_imp; eauto.
  intros. simpl in H1.
  eapply Zgt_trans; eauto.
Qed.
We can predict that we'll need a matching lemma for <
Lemma ForallT_less : t k k0,
    ForallT (fun k' _int2Z k' < int2Z k) t
    int2Z k < int2Z k0
    ForallT (fun k' _int2Z k' < int2Z k0) t.
Proof.
  intros; eapply ForallT_imp; eauto.
  intros. simpl in H1.
  eapply Z.lt_trans; eauto.
Qed.

Lemma balance_BST:
  c l k v r,
   ForallT (fun k' _(int2Z k') < (int2Z k)) l
   ForallT (fun k' _(int2Z k') > (int2Z k)) r
   BST l
   BST r
   BST (balance c l k v r).
Proof.
  intros.
  unfold balance.
Use same heavier-duty proof automation as before.

  repeat (match goal with
          | ⊢ BST (match ?c with Red_ | Black_ end) ⇒ destruct c
          | ⊢ BST (match ?s with E_ | T _ _ _ _ __ end) ⇒ destruct s
          | ⊢ ForallT _ (T _ _ _ _ _) ⇒ repeat split
          | H: ForallT _ (T _ _ _ _ _) ⊢ _destruct H as [? [? ?] ]
          | H: BST (T _ _ _ _ _) ⊢ _inv H
          end;
          (try constructor; auto; try omega)).
Use our lemmas on everything that is left.
  all : try eapply ForallT_greater; try eapply ForallT_less; eauto; try omega.
Qed.
With that fact about balance established, let's take a moment to prove that insertion preserves universal properties of tree nodes. As before, the balance pattern matching will generate lots of cases, so automation is essential.

Exercise: 2 stars, standard (balanceP)

Lemma balanceP : P l c r k v,
    ForallT P l
    ForallT P r
    P k v
    ForallT P (balance c l k v r).
Proof.
  intros.
  unfold balance.
(* Hint: Use proof automation with match goal and/or all: to handle this. *)
(* FILL IN HERE *) Admitted.

Exercise: 2 stars, standard (insP)

Lemma insP : P t k v,
    ForallT P t
    P k v
    ForallT P (ins k v t).
Proof.
(* Hint: Use the last lemma. There should be no need for automated case analysis. *)
(* FILL IN HERE *) Admitted.
Now to the main results on insertion. Again, all the hard work was done in proving balance_BST, so no automated case analysis should be needed here -- just a simple induction. Remember bdall!

Exercise: 3 stars, standard (ins_BST)

Lemma ins_BST : t k v,
    BST t
    BST (ins k v t).
Proof.
(* FILL IN HERE *) Admitted.

Exercise: 2 stars, standard (insert_BST)

Theorem insert_BST : t v k,
    BST t
    BST (insert k v t).
Proof.
  (* FILL IN HERE *) Admitted.

Proving fundamental laws relating insert and lookup.

Next, we want to show the fundamental laws for maps hold on red-black trees. The empty_tree case is trivial.

Lemma lookup_empty : k, lookup k empty_tree = default.
Proof.
  auto.
Qed.
As above, the hard work will be in handling balance. The following intermediate lemma is useful to characterize the interaction of lookup and balance, and is not too hard to prove with automation.

Exercise: 3 stars, standard (balance_lookup)

Lemma balance_lookup: k c k0 v0 l r,
            BST l
            BST r
            ForallT (fun k' _int2Z k' < int2Z k0) l
            ForallT (fun k' _int2Z k' > int2Z k0) r
            lookup k (balance c l k0 v0 r) =
              if (int2Z) k <? (int2Z k0) then
                lookup k l
              else if (int2Z k0) <? (int2Z k) then
                lookup k r
              else v0.
Proof.
  intros.
  unfold balance.
(* FILL IN HERE *) Admitted.
Now we can easily prove the main laws relating lookup and ins. Note that unlike for the search trees in SearchTree, these laws are only true for trees that have the BST property; without it, lookup might fail to find an inserted item moved by a balance operation.
Neither of these lemmas requires any automation; the hard case analysis work has all been done in balance_lookup.

Exercise: 2 stars, standard (lookup_ins_eq)

Lemma lookup_ins_eq: t k v,
    BST t
    lookup k (ins k v t) = v.
Proof.
(* FILL IN HERE *) Admitted.

Exercise: 3 stars, standard (lookup_ins_neq)

Theorem lookup_ins_neq: t k v k0,
    BST t
    k k0
    lookup k0 (ins k v t) = lookup k0 t.
Proof.
(* FILL IN HERE *) Admitted.
Finally, the main theorems relating lookup and insert. The proofs of these two are almost identical.

Exercise: 3 stars, standard (lookup_insert)


Theorem lookup_insert_eq : t k v,
    BST t
    lookup k (insert k v t) = v.
Proof.
(* FILL IN HERE *) Admitted.

Theorem lookup_insert_neq: t k v k0,
    BST t
    k k0
    lookup k0 (insert k v t) = lookup k0 t.
Proof.
(* FILL IN HERE *) Admitted.
OK, we're almost done! We have proved all these main theorems:
We could now proceed to reprove all the facts about elements that we developed in SearchTrees. Since elements does not not pay attention to colors, and does not rebalance the tree, these proofs should be a simple copy-paste from that chapter, with only minor edits. This would not be a very interesting exercise, so we don't pursue it here.

Proving Efficiency

Red-black trees are supposed to be more efficient than ordinary search trees, because they stay balanced. In a perfectly balanced tree, any two leaves have exactly the same depth, or the difference in depth is at most 1. In an approximately balanced tree, no leaf is more than twice as deep as another leaf. Red-black trees are approximately balanced. Consequently, no node is more then 2logN deep, and the run time for insert or lookup is bounded by a constant times 2logN.
We can't prove anything directly about the run time, because we don't have a cost model for Coq functions. But we can prove that the trees stay approximately balanced; this tells us important information about their efficiency.

Exercise: 4 stars, standard (is_redblack_properties)

The relation is_redblack ensures that there are exactly n black nodes in every path from the root to a leaf, and that there are never two red nodes in a row.

 Inductive is_redblack : tree color nat Prop :=
 | IsRB_leaf: c, is_redblack E c 0
 | IsRB_r: tl k kv tr n,
          is_redblack tl Red n
          is_redblack tr Red n
          is_redblack (T Red tl k kv tr) Black n
 | IsRB_b: c tl k kv tr n,
          is_redblack tl Black n
          is_redblack tr Black n
          is_redblack (T Black tl k kv tr) c (S n).

Lemma is_redblack_toblack:
   s n, is_redblack s Red n is_redblack s Black n.
Proof.
  (* FILL IN HERE *) Admitted.

Lemma makeblack_fiddle:
   s n, is_redblack s Black n
             n, is_redblack (makeBlack s) Red n.
Proof.
  (* FILL IN HERE *) Admitted.
nearly_redblack expresses, "the tree is a red-black tree, except that it's nonempty and it is permitted to have two red nodes in a row at the very root (only)."

Inductive nearly_redblack : tree nat Prop :=
| nrRB_r: tl k kv tr n,
         is_redblack tl Black n
         is_redblack tr Black n
         nearly_redblack (T Red tl k kv tr) n
| nrRB_b: tl k kv tr n,
         is_redblack tl Black n
         is_redblack tr Black n
         nearly_redblack (T Black tl k kv tr) (S n).

Lemma ins_is_redblack:
   x vx s n,
    (is_redblack s Black n nearly_redblack (ins x vx s) n)
    (is_redblack s Red n is_redblack (ins x vx s) Black n).
Proof.
  induction s; intro n; simpl; split; intros; inv H; repeat constructor; auto.
  × destruct (IHs1 n); clear IHs1.
    destruct (IHs2 n); clear IHs2.
    specialize (H0 H6).
    specialize (H2 H7).
    clear H H1.
    unfold balance.
You will need proof automation, in a similar style to the proofs of ins_not_E and balance_lookup.

(* FILL IN HERE *) Admitted.

Corollary ins_red :
   t k v n,
    (is_redblack t Red n is_redblack (ins k v t) Black n).
Proof.
  intros. apply ins_is_redblack. assumption.
Qed.

Lemma insert_is_redblack:
   x xv s n, is_redblack s Red n
                     n', is_redblack (insert x xv s) Red n'.
Proof.
  (* Just apply a couple of lemmas: ins_is_redblack and makeblack_fiddle *)
(* FILL IN HERE *) Admitted.

Exercise: 3 stars, standard, optional (redblack_bound)

To confirm that red-black trees really are roughly balanced, define functions to compute the height (maximum depth) and minimum depth of a red-black tree, and prove that the height is bounded by twice the minimum depth + 1.

Fixpoint height (t:tree) : nat
  (* REPLACE THIS LINE WITH ":= _your_definition_ ." *). Admitted.

Fixpoint mindepth (t:tree) : nat
  (* REPLACE THIS LINE WITH ":= _your_definition_ ." *). Admitted.

Lemma redblack_balanced : t c n,
    is_redblack t c n
    (height t 2 × mindepth t + 1)%nat.
Proof.
  (* Hints: 
     (1) Prove two auxiliary lemmas, one about height and the other about mindepth,
         and then combine them to get the result.
         The lemma about height will need a slightly complicated induction 
         hypothesis for the proof to go through. 
     (2) Depending on how you defined height and mindepth, the tactic zify 
         (defined in the standard library Coq.omega.PreOmega) may be useful as a 
         preliminary to using omega when proving these lemmas. *)

  (* FILL IN HERE *) Admitted.

End TREES.

Extracting and Measuring Red-Black Trees


Extraction "redblack.ml" empty_tree insert lookup elements.
You can run this inside the OCaml top level by:
#use "redblack.ml";;
#use "test_searchtree.ml";;
On my machine, in the byte-code interpreter this prints,
Insert and lookup 1000000 random integers in 0.889 seconds.
Insert and lookup 20000 random integers in 0.016 seconds.
Insert and lookup 20000 consecutive integers in 0.015 seconds.
You can compile and run this with the ocaml native-code compiler by:
$ ocamlopt -c redblack.mli redblack.ml
$ ocamlopt redblack.cmx -open Redblack test_searchtree.ml -o test_redblack
$ ./test_redblack
On my machine this prints,
Insert and lookup 1000000 random integers in 0.436 seconds.
Insert and lookup 20000 random integers in 0. seconds.
Insert and lookup 20000 consecutive integers in 0. seconds.

Success!

The benchmark measurements above (and in Extract.v) demonstrate that:
  • On random insertions, red-black trees are slightly faster than ordinary BSTs (red-black 0.436 seconds, vs ordinary 0.468 seconds)
  • On consecutive insertions, red-black trees are much faster than ordinary BSTs (red-black 0. seconds, vs ordinary 0.374 seconds)
In particular, red-black trees are almost exactly as fast on the consecutive insertions (0.015 seconds) as on the random (0.016 seconds).

(* Tue Jan 28 13:58:41 EST 2020 *)