SIAM J. COMPUT.
Vol. 27, No. 6, pp. 1695–1724, December 1998
c 1998 Society for Industrial and Applied Mathematics
009
COMPUTING THE LOCAL CONSENSUS OF TREES∗
SAMPATH KANNAN† , TANDY WARNOW† , AND SHIBU YOOSEPH†
Abstract. The inference of consensus from a set of evolutionary trees is a fundamental problem
in a number of fields such as biology and historical linguistics, and many models for inferring this
consensus have been proposed. In this paper we present a model for deriving what we call a local
consensus tree T from a set of trees T . The model we propose presumes a function f , called a
total local consensus function, which determines for every triple A of species, the form that the
local consensus tree should take on A. We show that all local consensus trees, when they exist,
can be constructed in polynomial time and that many fundamental problems can be solved in linear
time. We also consider partial local consensus functions and study optimization problems under this
model. We present linear time algorithms for several variations. Finally we point out that the local
consensus approach ties together many previous approaches to constructing consensus trees.
Key words. algorithms, graphs, evolutionary trees
AMS subject classifications. 05C05, 68Q25, 92-08, 92B05
PII. S0097539795287642
1. Introduction. An evolutionary tree (also called a phylogeny or phylogenetic
tree) for a species set S is a rooted tree with |S| = n leaves labeled by distinct elements
in S. Because evolutionary history is difficult to determine (it is both computationally
difficult as most optimization problems in this area are NP-hard and scientifically
difficult as well since a range of approaches appropriate to different types of data exist),
a common approach to solving this problem is to apply many different algorithms to
a given data set, or to different data sets representing the same species set, and then
look for common elements from the set of trees which are returned.
There is extensive literature about inferring consensus from ordered sets of trees,
with much attention paid to the properties of the rules for inferring the consensus. In
this paper, we will make an explicit assumption that the consensus rule be independent
of the ordering of the trees in the input; i.e., we will presume that the input to the
consensus problem is an unordered multiset of evolutionary trees, each leaf-labelled
by the elements in S. We call this input a profile, noting that in this paper the
terminology is restricted in meaning as we have indicated.
Several consensus methods are described in the literature for deriving one tree
from a profile of evolutionary trees. These methods include maximum agreement
subtrees [16, 19, 13, 24, 14], strict consensus trees [4, 9], median trees (also known
as majority trees) [5], compatibility trees [10, 11, 12], the Nelson tree [22], and the
Adams consensus [1].
The algorithms for some of these are implemented in standard packages and are
in use; most common, perhaps, are strict and majority consensus tree approaches.
∗ Received by the editors June 8, 1995; accepted for publication (in revised form) September 12,
1996; published electronically June 3, 1998. The research of the first author was supported in part
by NSF grant CCR-9108969. The research of the second author was supported in part by ARO grant
DAAL03-89-0031PRI, NSF Young Investigator Award, and by generous support from Paul Angello.
The research of the third author was supported in part by ARO grant DAAL03-89-0031PRI, a
fellowship from the Institute for Research in Cognitive Science at the University of Pennsylvania,
and a fellowship from the Program in Mathematics and Molecular Biology at the University of
California at Berkeley, which is supported by NSF grant DMS-9406348.
http://www.siam.org/journals/sicomp/27-6/28764.html
† Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA
19104 (kannan@central.cis.upenn.edu, tandy@central.cis.upenn.edu, yooseph@saul.cis.upenn.edu).
1695
1696
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
One notion of the information content of an evolutionary tree is the degree of
resolution indicated by the tree; this can be quantified in a number of ways, for
example, by counting the number of internal nodes or the number of resolved triples1
in the tree. This is because the most usual interpretation of an unresolved triple in an
evolutionary tree is that the evolutionary history of that triple cannot be absolutely
inferred from the data. Thus, for example, a completely resolved tree (i.e., a binary
tree) asserts a hypothesis about the evolution of all triples of taxa, while the star
(i.e., root with all taxa children of the root) does not assert any hypothesis about
the evolution of any triple. One of the motivations for proposing this new model of
consensus tree construction is the observation that on some data sets the strict and
majority consensus trees may be fairly uninformative (i.e., be fairly unresolved).
In this paper, we propose a new model, called the local consensus. This model
is based upon functions, called local consensus functions, for inferring the rooted
topology of the homeomorphic subtree induced by triples of species. We will show
that given any local consensus function, we can determine whether a tree (called the
local consensus tree) consistent with the constraints implied by the local consensus
function can be computed in polynomial time and that many of the natural forms of
the local consensus can be computed in linear time. We also analyze optimization
problems based upon partial local consensus rules and show that many of these can
also be solved in polynomial time. We will show that this method unifies many of the
previously favored approaches while providing greater flexibility to the biologists in
the interpretation of the data. Furthermore, the local consensus trees produced are, in
most cases, significantly more informative (in the sense of more refined; see the above
discussion) than trees produced using the strict or majority consensus methods.
2. Preliminaries.
2.1. Trees. Let S = {s1 , s2 , . . . , sn } be a set of species. An evolutionary tree
for S (also known as a phylogenetic tree or, more simply, a phylogeny) is a rooted
tree T with n leaves each labeled by a distinct element from S. The internal nodes
denote ancestors of the species in S. For an arbitrary subset S ′ ⊂ S we denote by
T|S′ the homeomorphic subtree of T induced by the leaves in S ′ . In particular, for a
specified triple {a, b, c} ⊂ S we denote by T|{a, b, c} the homeomorphic subtree of T
induced by the leaves labeled by a, b, and c. This topology is completely determined
by specifying the pair of species among a, b, and c whose least common ancestor (LCA)
lies farthest away from the root. If (a, b) is this pair then we denote this by ((a, b), c),
and T is said to be resolved on the triple a, b, c. If T is not binary it may happen
that all three pairs of species have the same LCA. In this case we will say that a, b, c
is unresolved in T and denote this topology by (a, b, c). In this paper, when we say
a triple a, b, c is resolved, we mean that T |{a, b, c} is one of ((a, b), c), ((a, c), b), or
((b, c), a).
For a profile P , which is defined by a multiset {T1 , T2 , . . . , Tk }, we let P |{a, b, c}
denote the multiset {T1 |{a, b, c}, T2 |{a, b, c}, . . . , Tk |{a, b, c}}.
Given a tree T containing nodes u, v, w, we let lcaT (u, v, w) denote the LCA of
u, v, and w in T . Also, we let u ≤T v denote that v is on the path from u to the root
of T .
2.2. Local consensus functions, rules, and trees. Let T (a, b, c) denote the
set of rooted subtrees on the leaf set {a, b, c} ⊆ S; thus |T (a, b, c)| = 4, with three of
1 See
section 2.1 for definitions of a resolved triple and an unresolved triple.
COMPUTING THE LOCAL CONSENSUS OF TREES
1697
the trees being resolved and one being the star (i.e., unresolved) tree on a, b, c.
A local consensus function is a function f which specifies the constraints for certain
(i.e., perhaps not all) triples a, b, c of species. Let A be the set of all three element
subsets of S. We define f : A → ∪{a,b,c}∈A T (a, b, c) ∪ {∗}. When f (X) = ∗, for some
X = {a, b, c} ∈ A, this indicates that the form of the triple a, b, c is unconstrained.
When f (X) = ∗∀X ∈ A, i.e., no triple is unconstrained, then f is said to be a total
local consensus function. Otherwise, f is said to be a partial local consensus function.
A rooted tree T (if it exists) which is leaf-labelled by elements from S and which
meets all the constraints implied by the local consensus function f is called an f-local
consensus tree.2 Note that when a triple a, b, c is set to be unconstrained by f , then
T |{a, b, c} can be any of the elements in T (a, b, c). Thus T is a tree such that for all
triples X ∈ A, T |X = f (X), if f (X) = ∗.
A local consensus function can be applied to a profile P . It is also possible for the
local consensus function to define the form of the output triple based upon the forms
the triple takes in the profile. Such local consensus functions are called local consensus
rules. Let M be the set of all multisets of size k, where each element of a multiset
belongs to T (a, b, c). A local consensus rule is a function f : M → T (a, b, c) ∪ {∗}.
If f (X) = ∗, for some X ∈ M, then f is said to be a partial local consensus rule;
otherwise, f is a total local consensus rule.
Given a profile P and a local consensus rule f , the f -local consensus tree (if it
exists) is a rooted tree T such that for all triples X ⊆ S, T |X = f (P |X), if f (P |X)
= ∗.3
It is not the case that a local consensus tree necessarily exists for an arbitrary
local consensus function (or rule) applied to an arbitrary input profile. Determining
whether a local consensus tree exists, and constructing it when it does, is the subject
of this paper.
The structure of the paper is as follows. In section 3, we will describe some
general techniques for determining if a local consensus tree exists. In particular, we
will give a polynomial time algorithm (based upon the algorithm in [3]), which can
determine if a local consensus tree exists for an arbitrary local consensus function (or
rule), and construct it when it does. We will also describe a class of natural local
consensus rules and describe general techniques for constructing local consensus trees
from such natural local consensus rules when they exist. In section 4, we then describe
some specific natural local consensus rules and some fast algorithms for constructing
the local consensus trees. In section 5, we consider optimization problems related
to constructing local consensus trees and present efficient algorithms to solve some
of these optimization problems. We conclude in section 6 with a discussion and
suggestions for extensions.
3. Techniques.
3.1. General local consensus functions. For an arbitrary local consensus
function f and an arbitrary profile of trees T = {T1 , T2 , . . . , Tk }, we can compute
the constraint indicated by f for every triple of species a, b, c. This produces a set of
O(n3 ) constraints on the consensus tree we wish to construct, where each constraint
is a rooted tree for a triple on a species set a, b, c. This rooted tree may be resolved
2 We
will also sometimes refer to it simply as a local consensus tree.
that f is defined the same on all triples X ⊆ S. As defined above, the triple labels a, b, c
serve merely as place holders. The definition of a local consensus rule can easily be changed to
accommodate a different rule for each triple.
3 Note
1698
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
(i.e., it may be of the form ((a, b)c)) or it may be unresolved (i.e., of the form (a, b, c)).
If there is a tree T meeting all these constraints, then T is the local consensus tree for
f . Thus, we can reduce the problem of consensus tree construction for an arbitrary
local consensus function to the problem of determining consistency of a set of rooted
triples.
3.1.1. Rooted triple consistency. We present results related to this general
problem.
Theorem 3.1. Determining if a tree T exists which meets a set of constraints
(and constructing it if it does) can be solved in O(pn log n) time if the constraints
include unresolved triples and otherwise can be solved in O(pn) time, where p is the
number of constraints defined by f .
Proof. In [3], Aho et al. describe algorithms which determine if a family of constraints on LCA relations can be satisfied within a single rooted tree. We describe here
the simple algorithm they give for the case where the constraints are given as rooted
resolved triples ((x, y), z). For such input the algorithm works top-down figuring out
the clusters at the children of the root before recursing. To do this the algorithm
maintains disjoint sets. Initially all leaves are in singleton sets. For each rooted triple
((x, y), z) the algorithm unions the sets containing x and y to indicate that x and y
must lie below the same child of the root. This algorithm never unions sets unless this
is forced. Recursive calls include constraints that are on species entirely contained
in the same component discovered in the previous call. If all the species are seen to
be in the same component (either initially or during a recursive call), the algorithm
determines that the constraints cannot be simultaneously satisfied. This simple algorithm has a worst-case behavior of O(pn), where there are p LCA constraints and the
underlying set S has n elements which will be leaves in the final tree.
However, we can also solve the consistency problem faster than by using the Aho
et al. algorithm. In [21], an algorithm is given for the problem addressed in [3] for the
case where all the triples are resolved. In this case a faster algorithm can be obtained.
Lemma 3.1 (Henzinger, King, and Warnow [21]). Let A be a set of√p resolved
rooted triples on a leaf set S with |S| = n. We can determine in min{O(p n), O(p +
n2.5 )} time whether a tree T exists such that T |{a, b, c} is homeomorphic to the rooted
triple(s) in A on {a, b, c} (if such a triple exists in A).
In the context of the rooted triple consistency problem, we also refer to the work
of [8, 7], where the conditions necessary for a given set of triple constraints to define
a tree are investigated.
3.2. Constructing local consensus trees in polynomial time. As a consequence of the results in the previous section, we can prove the following theorem.
Theorem 3.2. Let f be an arbitrary partial local consensus rule and T a set of
k evolutionary trees on S with |S| = n.
1. If every triple which is not set to ∗ is defined to be resolved by f , then we
can determine if the local consensus tree exists and construct it if it does in
O(kn3 ) time.
2. If f defines some triples (which are not set to ∗) to be unresolved, then we
can determine if the local consensus tree exists and construct it if it does in
O(kn3 + n4 log n) time.
Proof. Given f , T , and a triple A, we can determine the form of Tf |A (for those
triples A for which Tf |A is not unconstrained) in O(kn3 ) time. If all the triples
which are not set to unconstrained are defined to be resolved, then by Lemma 3.1 we
can determine if the partial local consensus tree exists and construct it if it does, in
COMPUTING THE LOCAL CONSENSUS OF TREES
1699
O(n2.5 + p) time, where p is the number of constraints. The total time is therefore
bounded by the cost of computing the triples. If some of the triples are unresolved then
we can use Theorem 3.1 to get an O(kn3 + n4 log n) algorithm which will determine
if the tree exists and construct it when it does.
3.2.1. Constructing local consensus trees from total local consensus
rules. While local consensus trees can be constructed in O(kn3 ) time from partial
local consensus rules, local consensus trees can be computed even faster when the
local consensus rule is total.
Lemma 3.2 (Kannan, Lawler, and Warnow [18]). Given an oracle O which can
answer queries of “What is the form of T |{a, b, c} for a species set {a, b, c}?”, we can
construct in O(n2 ) time a tree T consistent with all the oracle queries (if it exists)
and O(rn log n) time if the tree T has degree bounded by r.
Theorem 3.3. Let f be a total local consensus rule. Then given a set of k rooted
trees on n species, we can construct in O(kn2 ) time the f -local consensus tree Tf if it
exists. If f always returns resolved subtrees, then we can compute Tf in O(kn log n)
time.
Proof. We can implement the oracle determining the form of the homeomorphic
subtree of Tf on a triple a, b, c by first preprocessing the trees to answer LCA queries
in constant time using [20]. Then, answering a query needs only O(k) time. By [18],
we need only O(n2 ) queries and O(n2 ) additional work for a total cost of O(kn2 ) in
the general case. When Tf has degree bounded by r, we have total cost O(krn log n).
If f always returns resolved subtrees, then Tf will be binary, so that the total cost is
O(kn log n).
Note, however, that this algorithm does not verify that the tree constructed is the
local consensus tree; that is, it is possible that the constraints are inconsistent, so that
no local consensus tree exists for that local consensus function (or rule). When it does,
however, the tree constructed will equal the local consensus tree. Thus, when it can
be shown that the local consensus tree does exist, then this method will necessarily
produce the local consensus tree. In general, however, it will be necessary to verify
that the constructed tree is the local consensus tree.
We have described two algorithms for inferring whether a local consensus tree
exists for an arbitrary local consensus function (or rule). When the local consensus
function (or rule) is total, if the local consensus tree exists, it can be constructed
in O(kn2 ) time, where k is the number of trees in the profile and n is the number
of leaves in each tree. However, the tree that results then needs to be verified to
be the local consensus tree (and the fastest verification algorithm may still require
Ω(kn3 ) time). When the local consensus function (or rule) is partial, then a slower
O(kn3 ) algorithm can be used, but it simultaneously constructs and verifies that the
constructed tree is the local consensus tree.
3.3. Local consensus rules. A local consensus rule must handle essentially
three types of situations for each pattern of subtrees in the profile for a triple a, b, c
of species: profile constant on a,b,c; profile compatible on a,b,c; profile incompatible
on a,b,c. The profile of trees may agree on that set a, b, c, and thus all reflect the
same evolutionary history, or the trees may differ (in two different ways) on the triple.
Depending upon the pattern of different subtrees, the local consensus rule may elect
to constrain the form of the output or to leave the output unconstrained for that
triple. However, we will only consider a local consensus rule to be natural if it is
conservative, where by conservative we mean the following definition.
1700
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
Definition 3.1. Let P be a profile of evolutionary trees and f be a local consensus
rule. Then f is said to be conservative for every triple a, b, c, iff, f (P |{a, b, c}) =
((a, b), c), then a, b, c is not resolved as ((a, c), b) or ((b, c), a) in any of the trees in
P.
Being conservative is obviously a natural requirement, since to enforce a topological constraint which is contradicted in the profile is clearly unmotivated.
We now describe the three general scenarios that may arise and discuss the possible constraints that may arise under natural local consensus rules.
Profile constant on a, b, c. If all the trees in the profile have the same form on a
triple a, b, c, then we say the profile is constant on a, b, c. In this case, a natural local
consensus rule should either require that the consensus tree have the same form as
the trees in the profile, or it may leave the form unconstrained.
Profile compatible on a, b, c. If all the trees in the profile that have resolved
subtrees for a, b, c have the same resolved form (i.e., no two trees in the profile resolve
a, b, c differently), then the profile is said to be compatible on a, b, c. In this case,
the natural local consensus rule may elect to leave the tree unconstrained for a, b, c;
otherwise, it should constrain the output to either be the unique resolution indicated
by the profile or should constrain it to be unresolved. In the first case, we call the
local consensus rule optimistic, and in the second case we call the local consensus rule
pessimistic.
Profile incompatible on a, b, c. The remaining case is where the profile contains
trees which have different resolutions for a, b, c. In this case, a natural local consensus
rule may elect to require the consensus tree to be unresolved, or it may select one of
the resolutions represented in the profile4 (perhaps selecting the resolution with the
plurality representation), or it may not constrain the output at all.
A local consensus rule can be defined by deciding how it will respond to each of
the different situations that can arise. Thus, for example, a natural local consensus
rule may require that when the profile is constant on a, b, c, then the output tree is
constrained to have that same form, and it may elect to be optimistic in the presence
of compatible forms on a, b, c but may leave unconstrained any triple for which the
profile is incompatible.
In all of our following discussions, we restrict ourselves to profiles of two trees.
The techniques and most observations can be generalized.
4. Specific total local consensus rules. As examples of natural local consensus rules, we will define two total local consensus rules: the optimistic local consensus
(OLC) rule and the pessimistic local consensus (PLC) rule. These are not the only
natural local consensus rules that are worthy of study, but the techniques used for
constructing local consensus trees for these rules are indicative of general approaches
for greatly speeding up the construction and verification phases used in the previous
section.
When the trees are not necessarily binary, the local consensus rule may encounter
triples for which the profile is not constant but is nevertheless compatible. Because a
total local consensus rule must constrain the form of each triple for the consensus tree,
it must determine whether to require that the rooted triple be resolved or unresolved.
This decision is based upon the interpretation of an unresolved triple, which can
be made in one of two ways: any resolution of the three-way split is possible or the
unresolved triple indicates a three-way speciation event. If the local consensus rule
4 In
this case the conservative nature of the rule need not be maintained.
COMPUTING THE LOCAL CONSENSUS OF TREES
1701
chooses to interpret lack of resolution as being consistent with any resolution, then it
will constrain the output to be resolved according to the unique resolution present in
the profile, and otherwise it will constrain the output to be unresolved. The first type
of total local consensus rule is said to be optimistic and the second type pessimistic.
We now define these two consensus rules.
Definition 4.1. Let T1 and T2 be two rooted trees on the same leaf set S. A
rooted tree T is called the OLC of T1 and T2 iff for each triple a, b, c, T |{a, b, c} =
((a, b), c) iff Ti |{a, b, c} = ((a, b), c) and Tj |{a, b, c} = ((a, b), c) or (a, b, c) for {i, j} =
{1, 2}.
Definition 4.2. Let T1 and T2 be two rooted trees on the same leaf set S. A
rooted tree T is called the PLC of T1 and T2 iff for each triple a, b, c, T |{a, b, c} =
((a, b), c) iff T1 |{a, b, c} = T2 |{a, b, c} = ((a, b), c).
In the next two subsections we discuss efficient algorithms for these rules. But
first we give some basic and standard definitions.
Definition 4.3. Let T be a rooted tree with leaf set S. Given a node v ∈ V (T ),
we denote by L(Tv ) the set of leaves in the subtree Tv of T rooted at v. This is also
called the cluster at v and is represented by αv . The set C(T ) = {αv : v ∈ V (T )} is
called the cluster encoding of T .
Every rooted tree in which the leaves are labeled by S contains all singletons and
the entire set S in C(T ); these clusters are called the trivial clusters. We define a
maximal cluster to be the cluster defined by the child of the root. (Here we allow for
a maximal cluster to be defined by a leaf also.)
We also define the notion of compatibility of a set of clusters.
Definition 4.4. A set A of clusters is said to be compatible iff there exists a
tree T such that C(T ) = A.
The following proposition can be found in [17].
Proposition 4.1. A set A of clusters is compatible iff ∀αi , αj ∈ A, αi ∩ αj ∈
{αi , αj , ∅}.
We now state a theorem which will be used in the later sections.
Theorem 4.1. Let T1 and T2 be two rooted trees on the same leaf set S and let
f be a conservative local consensus rule. If the f -local consensus tree T exists, then
C(T ) ∪ C(T1 ) and C(T ) ∪ C(T2 ) are compatible sets.
Proof. Suppose not and suppose without loss of generality that C(T ) ∪ C(T1 ) is
not a compatible set. Then by Proposition 4.1, ∃α ∈ C(T ) and β ∈ C(T1 ) such that
α∩β ∈
/ {α, β, ∅}. Pick a ∈ α ∩ β, b ∈ α − β and c ∈ β − α. The topology of the
triple a, b, c in T1 is ((a, c), b) while in T it is ((a, b), c). Since f is a conservative local
consensus rule, this is impossible.
4.1. OLC. In this section we look at the problem of finding the OLC tree of two
trees defined in the previous section. Note that the OLC of two trees may not exist.
See Figure 1 for an example.
4.1.1. Characterization of the OLC tree. The following lemma characterizes
the OLC tree when it exists.
Theorem 4.2. Let T1 and T2 be two rooted trees on the same species set S. If
the OLC tree Tolc exists, then C(Tolc ) = A, where A = {α∗ | α∗ = α1 ∩ α2 , where
α1 ∈ C(T1 ) and α2 ∈ C(T2 ), and α∗ is compatible with both C(T1 ) and C(T2 )}.
Proof. Pick any cluster α ∈ A. If we look at any triple x, y, z with x, y ∈ α and
z ∈
/ α, then this triple will be resolved as ((x, y), z) in one tree and will be either
resolved the same or unresolved in the other tree. In either case, α ∈ C(Tolc ).
1702
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
a
d
+
DOES NOT EXIST
b
d
c
T
1
a
c
b
a b c d
T
2
a
b
d
c
Fig. 1. Example showing that the OLC need not always exist. The trees in the box are possible
candidates, but they each fail to maintain the necessary topology for some triple.
Conversely, pick any cluster α ∈
/ A. There are two cases here, namely, the case
when α is not compatible with at least one of C(T1 ) and C(T2 ) and the case when α
is compatible with both C(T1 ) and C(T2 ).
Now, when α is not compatible with at least one of C(T1 ) and C(T2 ), using
Theorem 4.1, we observe that α ∈
/ C(Tolc ).
For the second case, pick those smallest clusters α1 ∈ C(T1 ) and α2 ∈ C(T2 ) such
that α ⊆ α1 and α ⊆ α2 . (Note that the nodes v and u defining the clusters α1 and
α2 , respectively, are the LCAs in T1 and T2 , respectively, of the species in α.) Since α1
and α2 are the smallest clusters in T1 and T2 , respectively, containing α and since α
is compatible with both C(T1 ) and C(T2 ), this implies that α is the union of clusters
of at least two children of v and also the union of clusters of at least two children of
u. Moreover, ∃a, b ∈ α such that v = lcaT1 (a, b) and u = lcaT2 (a, b). Furthermore,
∃β ⊆ S, β = ∅, such that α1 ∩ α2 = α ∪ β. Thus we can pick a c ∈ β and we have that
T1 |{a, b, c} = T2 |{a, b, c} = (a, b, c). But the topology given by having α ∈ C(Tolc ) is
((a, b), c). Thus α ∈
/ C(Tolc ).
4.1.2. Construction phase. Since the OLC rule is conservative, if the tree Tolc
exists, then C(Tolc ) ∪ C(T1 ) is a compatible set of clusters, and hence there exists a
tree T ∗ satisfying C(T ∗ ) = C(T1 ) ∪ C(Tolc ). If we can construct T ∗ by refining T1 ,
we can then reduce T ∗ by contracting all the unnecessary edges and thus obtain Tolc .
This is the approach we will take.
Note that this approach breaks the construction into two stages: refinement and
contraction.
Definition 4.5. We say that a tree T1 is a refinement of tree T2 if T2 can be
obtained from T1 by a sequence of edge contractions.
Refining T1 . The main objective is to refine T1 so as to include all the clusters
from Tolc . Before we explain how we do this precisely, we will introduce some notation
and lemmas from previous works which enable us to do this efficiently.
COMPUTING THE LOCAL CONSENSUS OF TREES
1703
Definition 4.6. Let v be an arbitrary node in a tree T with children u1 , . . . , uk .
A representative set of v is any set {x1 , x2 , . . . , xk } such that xi ∈ αui . We denote
by rep(v) one such representative set.
Lemma 4.1. If the OLC tree Tolc of trees T1 and T2 exists and v ∈ T1 , then
Tolc |rep(v) is isomorphic to T2 |rep(v).
Proof. The proof follows from the fact that T1 |rep(v) is a star.
Definition 4.7. Let v be a node in a tree T with children u1 , u2 , . . . , uk . Then
N (v) is the subtree induced by {v, u1 , u2 , . . . , uk }.
We will do the refinement as follows. We will modify the tree T1∗ , where T1∗ is
initialized to T1 . In a postorder fashion, for every v ∈ V (T1 ) with representative set
{x1 , x2 , . . . , xk }, identify v ∗ = lcaT1∗ (αv ). It can be seen that v ∗ also has the same
number of children as v (since the processing is done in a postorder fashion). Say
these are u1 , u2 , . . . , uk . Replace the subtree T (v ∗ ), rooted at v ∗ in the following
manner: we replace N (v ∗ ) by an isomorphic copy of T2 |rep(v). Next, we replace xi
by the subtree of T1∗ rooted at ui .
Let T ∗ be the tree that is produced after considering all the nodes in T1 .
Theorem 4.3. Let T1 , T2 be given and suppose Tolc exists. Then the tree T ∗ that
is produced from the algorithm described in the previous paragraph satisfies C(T ∗ ) =
C(T1 ) ∪ C(Tolc ).
Proof. Since C(Tolc ) ∪ C(T1 ) is compatible, all we need to show is that Tolc |rep(v)
cannot be a proper refinement of T2 |rep(v). If it were, then for some {a, b, c} ⊆ rep(v),
Tolc |{a, b, c} would be resolved while T2 |{a, b, c} is unresolved. Since {a, b, c} ⊆ rep(v),
T1 |{a, b, c} is also unresolved, forcing Tolc to be also unresolved.
Note that we have reduced the problem of constructing T ∗ to the problem of
discovering T2 |rep(v) for each v ∈ T1 .
To have a linear time algorithm, however, we need to be able to compute T2 |rep(v)
quickly. We cite the following result from [18] which will be useful to us in this case.
Lemma 4.2 (see [18]). Given a left-to-right ordering of the leaves of a tree and
the ability to determine the topology of any triple of leaves a, b, c in constant time, we
can construct the tree in linear time.
To use this lemma we need two things:
(1) we must be able to determine the topology of any triple in T2 in O(1) time
and
(2) we must have for each node in T1 an ordered representative set, where the
ordering is consistent with the left-to-right ordering of the leaves in T2 .
To accomplish (1), we first preprocess T2 for LCA queries. Then, to determine
the topology for the triple a, b, c, we simply compare the LCAs of (a, b), (b, c), and
(a, c). The second requirement is more challenging but can also be handled, as we
now show.
Computing all ordered representative sets in O(n) time.
• Initially all nodes in T1 have empty labelings.
• For each s ∈ S, taken in the left-to-right ordering of the leaves in T2 , do the
following steps:
1. trace a path in T1 from the leaf for s toward the root, until encountering
either the root or a node which has already been labeled;
2. append s to the ordered set for each such node in the path traced (including the first node encountered which has already been labeled).
Figure 2 shows an example of the computation just described.
Note that this computation takes O(n) time since each node v is visited O(deg(v))
1704
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
T1
T2
(i)
e
a
Left-to-right ordering
acdbe
a c
d
b c d
b
e
r
(iii)
v
(ii)
u
a
w
e
b c d
a is added to rep sets of u, v and r
(iv)
v
a
e
b c d
c is added to rep sets of w and v
After completion
rep(u) = {a,b}
rep(v) = {a,c}
rep(r) = {a,e}
rep(w) = {c,d}
Fig. 2. Example showing the computation of the representative sets of nodes in T1 based on
the left-to-right ordering of species in T2 .
times and that the order produced is exactly as required. Thus, for each node v ∈
V (T1 ), we have defined a set of leaves such that each leaf is in a different subtree of
v, every subtree of v is represented, and the order in which these leaves appear is the
same as the left-to-right ordering in T2 .
We have thus proved Lemma 4.3.
Lemma 4.3. We can compute T2 |rep(u) in O(|rep(u)|) time.
We therefore have the following theorem.
Theorem 4.4. Given T1 , T2 , then we can construct a tree T ∗ such that C(T ∗ ) =
C(T1 ) ∪ C(Tolc ) whenever Tolc exists in O(n) time.
The rest of the task of constructing Tolc is in the contraction of unneeded edges.
Contracting T . Now that T ∗ satisfies C(T ∗ ) = C(T1 ) ∪ C(Tolc ), we can simply
go through each edge in T ∗ and check if it needs to be kept or must be deleted. Note
that edges that were added during the refinement phase are required and do not need
to be checked. Therefore, we need only check the original tree edges. Let (u, v) be
such an edge with v = parent(u). From our representative sets for u and v we can
easily choose three species a, b, c such that lca(a, b) = u and lca(b, c) = v. If the
topology of this triple in T2 is resolved differently than ((a, b), c), then we know that
edge (u, v) will have to be contracted; if on the other hand T2 |{a, b, c} is either (a, b, c)
or ((a, b), c) then (u, v) will have to be retained in any OLC tree.
COMPUTING THE LOCAL CONSENSUS OF TREES
1705
OLC Construction Algorithm
Phase 0: Preprocessing
Make copies T1′ and T2′ of T1 and T2 , respectively. For each node v in each tree Ti′
(i = 1, 2), compute ordered representative sets ordered by the left-to-right ordering
in the other tree. Preprocess each tree Ti′ to answer lca queries for leaves as well as
internal nodes.
Phase I: Refine T1′
Refine T1′ in a postorder fashion so that at the end C(T1′ ) = C(T1 ) ∪ C(Tolc ) if
Tolc exists.
Phase II: Contract T1′
Contract edges e ∈ E(T1′ ) such that ce , the cluster below e, lies in C(T1 )−C(Tolc ).
We have thus shown the following theorem.
Theorem 4.5. The algorithm stated above constructs the OLC of two trees T1
and T2 if the OLC exists.
Analysis of Running Time
Phase 0: Preprocessing
In [20], Harel and Tarjan give an O(n) time algorithm for preprocessing trees to
answer LCA queries in constant time. We have already shown that computing the
ordered representative sets takes O(n) time. Thus the preprocessing stage takes O(n)
time.
Phase I: Refining T1′
This stage involves local refinements of T1′ , and we have shown that the cost of
refining around node v is O(deg(v)). Summing over all nodes v we obtain O(n) time.
Phase II: Contracting edges
This stage clearly takes only O(n) time.
Theorem 4.6. Construction of the optimistic local consensus tree can be done
in linear time.
4.1.3. Verification phase. We have identified a candidate optimistic local consensus tree. We now have to decide if this is really such a tree or that no such tree
exists.
Lemma 4.4. Let T be a tree on a leaf set S. Let T ∗ be obtained from T through a
sequence of refinements followed by a sequence of edge contractions. Then there exists
a function f : V (T ) → V (T ∗ ) such that for all v ∈ V (T ), there is a subset Sv of the
children of f (v) in V (T ∗ ) such that αv = ∪v′ ∈Sv αv′ .
Proof. We define f (v) = lcaT ∗ (αv ). Clearly, C(T ∗ ) ∪ C(T ) is a compatible
set of clusters. Therefore, there is a subset Sv of the children of f (v) such that
∪v′ ∈Sv αv′ = αv .
We take a slight detour and examine the verification of the OLC when the two
input trees are both binary. In this case no triple will be unresolved.
Definition 4.8. A caterpillar is a rooted binary tree with only one pair of sibling
leaves.
Given a leaf labeled caterpillar T ′ with root r and height h, there is a natural
ordering induced by T ′ on its leaves. Let g : S → {1, 2, . . . , h} be a function where
g(s) is the distance of s from r.
Then the species in S can be ordered in the increasing order as a1 , a2 , . . . , an ,
where ai ∈ S such that g(a1 ) < g(a2 ) · · · < g(an−1 ) ≤ g(an ). (Note that the pair of
sibling leaves have been arbitrarily ordered.)
1706
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
Definition 4.9. Two caterpillars X and Y on the same leaf set are said to be
oppositely oriented iff for all k, the k smallest elements of X are contained among
the k + 1 largest elements of Y and vice versa. See Figure 3.
b
f
c
e
a
d
d
c
a b
e f
T1
T2
Fig. 3. Example of oppositely oriented caterpillars.
Proposition 4.2. Let T1 and T2 be two rooted binary trees on the same leaf set
whose OLC is a star. If a, b is a sibling pair of leaves in T1 , then the LCA of a and
b in T2 must be the root of T2 .
Proof. Suppose Proposition 4.2 is not true. Then there is a species c such that
the LCA of (a, c) is above the LCA of (a, b) in T2 . Then T1 |{a, b, c} = T2 |{a, b, c} and
hence the OLC of T1 and T2 cannot be a star.
Lemma 4.5. Suppose T1 and T2 are binary trees on the same leaf set and suppose
that they each have at least five leaves. If their OLC tree is a star, then T1 and T2
must be caterpillars.
Proof. Suppose for contradiction that T1 is not a caterpillar. Then it has two
pairs of sibling leaves (a, b) and (c, d). By the previous proposition each of these pairs
must have the root as their LCA in T2 . Thus without loss of generality, a and c lie
in the left subtree of the root of T2 , and b and d lie in the right subtree of the root of
T2 .
a b
c d
T1
a c
x
b d
T2
Fig. 4. Topologies of T1 and T2 with respect to a, b, c, d, x.
Let x be any other species besides a, b, c, and d (see Figure 4). Suppose without
loss of generality that x lies in the left subtree of the root of T2 . We will consider
the following two triples: x, a, d and x, c, b. In T2 the topology of these triples will be
((x, a), d) and ((x, c), b), respectively.
COMPUTING THE LOCAL CONSENSUS OF TREES
1707
We will show that T1 agrees on at least one of these triples. There are two cases.
If x lies in the left subtree of the root of T1 , then the topology of the triple x, a, d in
T1 is clearly ((x, a), d) and if x lies in the right subtree of the root of T1 , then the
topology of the triple x, c, b in T1 is ((x, c), b). Thus in either case there is a triple in
T1 which agrees with a triple in T2 , and the OLC cannot be a star.
Lemma 4.6. Let T1 and T2 be two caterpillars on the same leaf set. Then the
OLC of T1 and T2 is a star iff T1 and T2 are oppositely oriented caterpillars.
Proof. Suppose the two caterpillars are oppositely oriented, i.e., they satisfy the
two intersection conditions. Let x, y, z be any three leaves and let their indices in the
ordering of the leaves of T1 be i < j < k, respectively. Then the topology of x, y,
and z in T1 is (x, (y, z)). Looking at the n − j smallest elements in T2 , this set must
contain y or z but cannot contain x. Consequently, the topology of the triple in T2 is
not (x, (y, z)) and the star is a valid OLC.
Conversely, suppose that the two caterpillars do not satisfy the intersection conditions. Without loss of generality, suppose that there exists at least one k such that
the k smallest elements of T2 are not contained within the k + 1 largest elements of
T1 . Pick the smallest such k. Say x is the leaf in T2 with rank k and x does not
belong to the set of k + 1 largest elements of T1 . From the pigeonhole principle, there
will exist at least two leaves of T2 which have ranks greater than k but which are contained in the set of k + 1 largest elements of T1 . Suppose the two leaves are y and z.
Then T1 |{x, y, z} = T2 |{x, y, z} = (x, (y, z)). This implies that the OLC cannot be a
star.
Corollary 4.1. The OLC for two binary trees can be verified to be a star in
linear time.
Now we return to the general case of verifying the OLC of two trees.
Lemma 4.7. Suppose T is the OLC of T1 and T2 (on a leaf set S containing at
least five species). Then T is a star iff either one of the following holds:
1. both T1 and T2 are oppositely oriented caterpillars or
2. both T1 and T2 are stars.
Proof. The “if” direction is easy to see. We now assume that the OLC, T , is a
star. If T1 contains a triple a, b, c that is unresolved, T2 must also be unresolved on
a, b, c. Conversely whenever T1 is resolved on a, b, c, T2 must be (differently) resolved
on a, b, c. Thus either both T1 and T2 are binary or both are not.
In the case that both T1 and T2 are binary, we appeal to the proofs of Lemmas
4.5 and 4.6 to argue that T1 and T2 must be oppositely oriented caterpillars.
If T1 and T2 are not binary, we will show that for any node v in T1 with children
{u1 , . . . , uk }, k ≥ 3, there is a node v ′ in T2 with children {u′1 , . . . u′k } such that
αui = αu′i . Pick any three species a, b, c such that a, b, c is unresolved in T1 and let
v = lcaT1 (a, b, c). Then a, b, c must be unresolved in T2 . Let v ′ = lcaT2 (a, b, c). We
/ αv′
claim that αv = αv′ . To see why, suppose αv = αv′ and suppose x ∈ αv , x ∈
with x being in the same subtree under v as a. Then T1 |{b, c, x} = (b, c, x), whereas
T2 |{b, c, x} = ((b, c), x). This contradicts the assumption that T is a star. Thus
αv = αv′ .
Next, note that if x and y are under the same child of v in T1 but under different
children of v ′ in T2 , then there exists a z such that x, y, z is resolved in T1 but
unresolved in T2 . This would contradict the fact the T is a star. This establishes the
claim.
This implies that if there is a nonbinary node v that is not the root of T1 , we can
find two species a, b (a ≤ v, b ≤ v) and a species c, c ≤ v such T1 |{a, b, c} = T2 |{a, b, c}.
1708
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
Thus the root must have three or more children in this case. But this means that if
any cluster defined by a child of the root contains two or more species, then there is
a triple on which T1 and T2 agree. Thus T1 and T2 must be stars.
The verification proceeds as follows:
Phase 0
Suppose the tree constructed by refining T1 and then contracting the edges in the
resulting tree is T . We will do the same modification on T2 , i.e., refine T2 using the
information from T1 and then contract the edges in the resulting tree as before. Call
′
′
this tree T . Clearly, if T is not isomorphic to T , we can terminate and output that
the OLC does not exist. This is because we know that a compatible set of clusters
defines a unique tree and we know that the OLC, if it exists, is uniquely characterized.
Phase 1
If Phase 0 is successful, we then verify further. We compute an ordered representative set for every node w in V (T ). For each node w in T , do the following
steps.
1. Check if the homeomorphic subtrees of T1 and T2 induced by rep(w) are both
stars or they are both oppositely oriented caterpillars. If they are neither of
these, then terminate and output that the OLC does not exist.
2. Identify the parent of w, say w∗ . Look at rep(w∗ ) excluding the representative
element which is below w. Call this set A. Identify the LCAs of rep(w) in
T1 and T2 . Check if there is a species that belongs to A which lies below the
LCA of rep(w) in both T1 or T2 . If so, terminate and output that the OLC
does not exist.
Implementation of step 1 of Phase 1. Using the left-to-right ordering of the species
in T1 , compute the ordered representative set rep at each node in T as shown in the
previous section. For any u ∈ V (T ), to be able to quickly compute the homeomorphic
subtree of T2 induced by the species in rep(u), we need to know the ordering of theses
species as they appear in the left-to-right ordering of T2 . We associate with each u, a
new rep set, rep∗ (u), which is the rearranged version of the species in rep(u) according to their ordering in T2 . We define a function, limit : S → V (T ), which specifies
for each s ∈ S the node v ∈ V (T ) closest to the root of T such that s ∈ rep(v).
The function limit together with the left-to-right ordering of the species in T2 help in
filling the rep∗ sets, since s will belong to the rep∗ sets of all nodes in the path from
s to limit(s). We first show how to compute limit(s)∀s ∈ S using algorithm LIM IT
and then we show how the rep∗ sets are filled.
Initialization:
limit(s) = +∞∀s ∈ S.
Procedure LIMIT
For each v ∈ V (T ) visited in a top-down traversal of T ,
do {
Identify rep(v)
For each s ∈ rep(v) such that limit(s) = +∞
set limit(s) = v
}enddo
Once limit(s) has been identified for all s ∈ S, we proceed to compute rep∗ (u)∀u ∈
V (T ) as follows. Look at the left-to-right ordering of the species in T2 . Now, for each
species s in the left-to-right order, we trace a path in T from the leaf for s toward
COMPUTING THE LOCAL CONSENSUS OF TREES
1709
the root of T and add s to the rep∗ set of each node encountered in this path. We
terminate when we reach limit(s).
Note that this process of identifying rep and rep∗ has to be done only once.
Analysis of running time. The isomorphism test in Phase 0 can be performed in
O(n) using a simple modification of the tree-isomorphism testing algorithm in [2].
There is an O(n) cost for preprocessing of T1 and T2 to answer LCA queries in
Phase 1.
Our implementation of step 1 of Phase 1 involves a one-time O(n) cost in preprocessing to identify rep and rep∗ for each node in T . Then each time step 1 is called
on a node w ∈ V (T ), an additional time of O(deg(rep(w))) is taken.
Exploiting that fact that T1 and T2 have been preprocessed to answer LCA
queries, it can be seen that each step 2 of Phase 1 takes O(deg(w) + deg(w∗ )).
Thus the total time taken in the verification phase is O(n).
Correctness of our verification procedure. See Theorem 4.7.
Theorem 4.7. If T passes the above tests, then T is the OLC of T1 and T2 .
Proof. We need only show that T handles every triple properly. Each of the
following cases is handled assuming T has passed the isomorphism test.
′
Case 1. If T passes the isomorphism test with T , then any triple a, b, c such that
the two trees resolve a, b, c differently will be unresolved in T . This follows since T
is created by refining and then contracting both T1 and T2 , and these actions cannot
take a resolved triple into a different resolution.
Case 2. This involves a triple a, b, c having the same topology ((a, b), c) in both
T1 and T2 . We claim that the first step of Phase 1 will pass only if the topology of
this triple is ((a, b), c). To see why, suppose a, b, c is unresolved in T . (a, b, c cannot
be resolved as (a, (b, c)) or ((a, c), b) in T .) Look at the nodes u and v, which are the
LCAs of a, b in T1 and T2 , respectively. The node w in T , which is the lca(a, b, c),
is also lca(a, b) (since a, b, c is unresolved). We infer that f (u) = w, where f is the
function as defined in Lemma 4.4. This is because any node above w will contain the
species c and any node below w will not contain either a or b. By a similar argument,
f (v) = w. Now, when we look at rep(w) and compute the homeomorphic subtrees
of T1 and T2 induced by rep(w), in both of these induced trees, there will exist three
species x, y, z such that x, y are both below u (and v) in T1 (and T2 ) and z is not in
the character defined by u (and v). Thus in both the induced trees, the triple x, y, z
will have the same topology ((x, y), z). That is, these induced trees will neither be
both stars nor both oppositely oriented caterpillars. Thus the verification process will
terminate and output that the OLC does not exist.
Case 3. This involves a triple a, b, c which is resolved as ((a, b), c) in one tree and
unresolved in the other. The proof of this case essentially follows the lines of the proof
of Case 2.
Case 4. This involves a triple a, b, c which is unresolved in both trees. We claim
that the second step of Phase 1 will pass only if this triple is unresolved in T . To
see why, suppose a, b, c is resolved as ((a, b), c) in T . Let lcaT (a, b, c) = x and let
lcaT (a, b) = y and also suppose without loss of generality that x is the parent of y.
Let y1 be the child of y such that a ∈ αy1 and let y2 be the child of y such that
b ∈ αy2 . Let z = y be the child of x such that c ∈ αz .
Let u = lcaT1 (a, b, c) and v = lcaT2 (a, b, c).
We will look at functions f1 and f2 defined by Lemma 4.4 from V (T ) to V (T1 )
and V (T2 ), respectively. Clearly f1 (y) = u and f2 (y) = v. Note that the cluster
defined by any child of u can have a nonempty intersection with at most one of αy1
1710
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
and αy2 . This is similar for v. Thus any representatives chosen from αy1 and αy2 ,
respectively, have their LCA at u in T1 and at v in T2 . However, f1 (z) ≤T1 u and
f2 (z) ≤T2 v. Thus any representative chosen from αz will lie below u and v in T1 and
T2 , respectively, causing us to conclude that the OLC does not exist.
4.2. PLC. Recall the definition of the PLC tree: Let T1 and T2 be two rooted
trees on the same leaf set S. A rooted tree T is called the PLC of T1 and T2 iff for
each triple a, b, c, T |{a, b, c} = ((a, b), c) iff T1 |{a, b, c} = T2 |{a, b, c} = ((a, b), c).
Just like the OLC, the PLC tree need not always exist either.
4.2.1. Characterization. The following theorem characterizes the PLC tree of
two trees T1 and T2 .
Theorem 4.8. Let T1 and T2 be two trees on the same leaf set S. If the PLC tree
Tplc of T1 and T2 exists, then it is identically equal to T , where C(T ) = C(T1 )∩C(T2 ).
Proof. Pick any cluster α ∈ C(T ). Since α belongs to both the trees, if we look
at any triple x, y, z with x, y ∈ α and z ∈
/ α, then this triple will have to be resolved
as ((x, y), z). Thus α ∈ C(Tplc ).
Conversely, pick any cluster α ∈
/ C(T ). We have two subcases here.
1. α is not compatible with at least one of C(T1 ) or C(T2 ). In this case, from
Theorem 4.1, α ∈
/ C(Tplc ).
2. α is compatible with both C(T1 ) and C(T2 ). In this case, pick those nodes
from T1 and T2 that define the smallest clusters containing α. We can pick
a triple a, b, c such that a ∈ α, b ∈ α, c ∈
/ α and this triple is unresolved in
either T1 or T2 . Thus α ∈
/ C(Tplc ).
4.3. Construction phase. By Theorem 4.8, the PLC tree, if it exists, is identically the strict consensus tree. Thus to construct the PLC tree, it suffices to use the
O(n) algorithm in [9] for the strict consensus tree.
4.3.1. Verification phase. Let T1 and T2 be the input trees, and let T be the
strict consensus tree constructed using the algorithm in [9]. We want to be able to
verify whether T is actually the PLC in the case that T is a star. If T1 or T2 is already
a star then there is nothing to verify since T is the true PLC. So assume that this is
not the case.
There are two cases which we will consider. The first is when either of T1 or T2
(say T1 ) has at least two children of the root which are not leaves. The second case
is when both T1 and T2 have exactly one child of the root which is not a leaf. Having
made observations about these cases, we can apply a divide and conquer strategy as
seen by the following lemma.
Lemma 4.8. Let T1 and T2 be rooted trees on the same leaf set and let α be a
cluster in their intersection. Let T be the strict consensus tree of T1 and T2 . Let
e1 , e2 , and e be the edges in T1 , T2 , and T respectively, that are above the respective
internal nodes which define the cluster α. Let a be a species in α. Then T is a PLC
for T1 and T2 iff
(1) the subtree below e is a PLC for the subtrees below e1 and e2 , and
(2) upon replacing the subtrees below e, e1 , and e2 by a in T, T1 , and T2 , respectively, T is a PLC for T1 and T2 .
Proof. Clearly, if T is the PLC tree for T1 and T2 then conditions (1) and (2) will
hold. Conversely, if (1) and (2) hold, but T is not the PLC tree for T1 and T2 , then
there is some triple a, b, c such that T incorrectly handles this triple. If all of a, b, c
are below e then by condition (1), T handles a, b, c correctly. Similarly if at least two
are above e, then by condition (2), T handles this triple correctly. It remains to show
COMPUTING THE LOCAL CONSENSUS OF TREES
1711
that T handles all triples where exactly two of a, b, c are below and one is above the
edge e. But then, since the cluster α ∈ C(T1 ) ∩ C(T2 ) = C(T ), in each of T1 , T2 , and
T , we have ((a, b)c), so that T handles this triple properly. Thus T is a PLC for T1
and T2 .
Thus the verification proceeds by traversing T in a postorder fashion and at
the end of each successful verification step replacing the subtree by a single element
belonging to the cluster defined by the root of the subtree. We now discuss the details
of each verification step.
Lemma 4.9. Suppose T1 and T2 are two trees on the same leaf set S with T1
having at least two children of the root which are not leaves. Let α1 , . . . , αl be the
maximal clusters of T1 and β1 , . . . , βm be the maximal clusters of T2 . Then T , their
PLC, is a star iff ∀i, j |αi ∩ βj | ≤ 1.
Proof. Suppose ∀i, j |αi ∩βj | ≤ 1. This means that ∀x, y, if lca(x, y) in T1 is below
the root, then in T2 , lca(x, y) is the root. Thus for any triple x, y, z, their topologies
in T1 and T2 do not agree. Thus T is a star.
Suppose ∃i, j |αi ∩ βj | > 1. Thus αi is defined by a node which is not a leaf. Look
at an αk , k = i, such that the node in T1 defining αk is not a leaf node. There are
two cases to handle here. Either at least one species in αk is not in βj or all species
in αk are in βj (i.e., αk ⊂ βj ).
In the former case, pick that species z that is in αk but not in βj . Also pick those
two species x, y that are in αi ∩ βj . Both T1 and T2 agree on the triple x, y, z; namely
this triple has topology ((x, y), z) in both the trees. Thus T cannot be a star.
In the latter case, since we know that βj = S, we can pick two species x, y from
αk and another species z from S − βj . In both T1 and T2 , the topology of this triple
is ((x, y), z). Thus T cannot be a star.
Since each species belongs to at most one of these maximal clusters in each tree,
this test can be done in linear time.
The following lemma handles the case when both T1 and T2 have exactly one
child of the root which is not a leaf.
Lemma 4.10. Suppose T1 and T2 are two trees on the same leaf set S and T and
their PLC is a star. Suppose both T1 and T2 have exactly one child of the root each
which is not a leaf. Let s1 , . . . , sk be leaves in T1 which are children of the root. Let
v be the LCA in T2 of s1 , . . . , sk . Then every child of v contains at most one species
x ∈ S − {s1 , . . . , sk }. Moreover, for any pair of species x, y ∈ S − {s1 , . . . , sk }, the
LCA of x and y in T2 lies on the path from v to the root.
Proof. Suppose ∃ a child of v which contains at least two species from S −
{s1 , . . . , sk }. Then by picking x, y such that they both lie under this child if v in T2
and picking an si out of s1 , . . . sk that lies under a different child of v, we find that
both trees have the same topology for the triple x, y, si . Thus T cannot be a star.
Furthermore, if ∃x, y ∈ S − {s1 , . . . , sk } such that lca(x, y) in T2 does not lie on the
path from v to the root, then the triple x, y, s1 would have identical topologies in both
trees and T wouldn’t be a star.
Definition 4.10. A rooted tree T is a millipede if the set of internal nodes of T
defines a single path from the root to a leaf. See Figure 5.
Let S1 = S − {s1 , s2 , . . . , sk }. We have that T2 |S1 is a millipede (say, T2∗ ).
Let u1 , . . . , ul be the children of the root in T2∗ , which are leaves. Look at T1 |S1
(say, T1∗ ). Either, T1∗ has one nonleaf child or it has at least two nonleaf children. In
the former case, we can apply the previous lemma and infer that T1∗ |(S1 −{u1 , . . . , ul })
will be a millipede. In the later case, we can apply Lemma 4.10 to check if the PLC
is a star.
1712
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
h i
e fg
d
a b
c
Fig. 5. An example of a millipede.
In the following subsection we will show how to verify if T is a star when both
the input trees are millipedes.
4.3.2. Verification when both the input trees are millipedes. The proof
of the following lemma is straightforward.
Lemma 4.11. Suppose T1 and T2 are two millipedes on the same leaf set S.
Then their PLC T is a star iff there exists no triple such that both trees have the same
resolved topologies on the triple.
We now describe a linear time algorithm for verifying that T1 and T2 have no
triple on which they have the same topology.
We define an ordering on the species in T1 using the function f : S → {1, . . . , h},
where f (s) = distance of s from the root of T1 and h is the height of T1 .
In T2 , we can write S as the union of all the sets in the sequence S1 , S2 , . . . , Sk ,
where k is the height of T2 and each Si contains exactly those species which are at a
distance i from the root of T2 . Now, in each Si replace each species s in this set with
f (s). Call this multiset of integers Mi . We thus get a sequence M1 , M2 , . . . , Mk of
multisets.
Definition 4.11. We will say a triple of integers p, q, r is special if
• p < q, p < r;
• p ∈ Mj1 , q ∈ Mj2 , r ∈ Mj3 , with 1 ≤ j1 < j2 ≤ k and 1 ≤ j1 < j3 ≤ k.
We observe that the PLC of T1 and T2 is a star iff no special triple p, q, and r
exists.
The following CHECK PLC algorithm takes as input the sequence M1 , M2 , . . . , Mk
and returns F AIL if there exists a special triple of integers, and otherwise it returns
P ASS.
CHECK PLC works by scanning the multiset Mi in the ith iteration. It makes use
of three variables global min, local min, and temp. At the start of the ith iteration,
global min stores the smallest integer seen in the first i − 1 multisets. The variable
local min is used to store the smallest integer a such that ∃b for which a < b and
a ∈ Mj , b ∈ Ml with 1 ≤ j < l < i. (local min is initialized to +∞.) The variable
temp is initialized to 0. As long as temp remains 0, local min = +∞. If temp is
nonzero, then local min stores a and temp stores some b for which the previously
mentioned relationship between a and b holds. At the ith iteration, CHECK PLC
either returns F AIL (if a special triple exists) or, if necessary, it modifies the variables
global min, local min, and temp to hold their intended values for the first i multisets
of the sequence.
COMPUTING THE LOCAL CONSENSUS OF TREES
1713
The reasoning for storing these values at the start of the ith iteration is as follows.
If ∃p in some Mj , and q, r ∈ Mi (1 ≤ j < i) such that p, q, r is a special triple, then
global min together with q, r ∈ Mi are also a special triple since global min ≤ p.
Similarly, if ∃p in some Mj , q ∈ Ml , r ∈ Mi (1 ≤ j < l < i), such that p, q, r is a
special triple, then local min, temp, and r ∈ Mi are also a special triple.
We now describe CHECK PLC.
Initialization:
global min = M in(M1 )
local min = +∞
temp = 0.
The procedure outputs F AIL (and terminates) if the PLC is not a star; it outputs
P ASS otherwise.
Procedure CHECK PLC
For 2 ≤ i ≤ k,
do {
If temp = 0, then Step 1, else Step 2.
Step 1
do {
Scan through Mi ;
Identify A = {y|y ∈ Mi , global min < y};
If |A| ≥ 2, then output F AIL;
If |A| = 1, then set temp = y, where y ∈ A
local min = global min
global min = M in{global min, M in(Mi )};
If |A| = 0, then set global min = M in(Mi ).
} enddo
Step 2
do {
Scan through Mi ;
Identify A = {y|y ∈ Mi , global min < y};
Identify B = {z|z ∈ Mi , local min < z};
If either |A| ≥ 2 or |B| ≥ 1, then output F AIL;
Else
If |A| = 1 then
If global min < M in(Mi ), then set local min = global min
temp = M in(Mi );
If global min > M in(Mi ), then set local min = global min
temp = y, where y ∈ A
global min = M in(Mi );
If |A| = 0 then set global min = M in(Mi ).
} enddo
} enddo
Output P ASS
Analysis of running time. CHECK PLC runs in linear time since each Mi is
scanned only a constant number of times.
Theorem 4.9. Algorithm CHECK PLC is correct.
Proof. By induction, observe that Step 1 is executed at the ith iteration if ∀j, l, x,
where 1 ≤ j < l < i and x ∈ Ml , M in(Mj ) ≥ x. It then follows that if Step
1 is executed at the ith iteration, then at the start of that iteration temp = 0,
1714
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
global min = M in(Mi−1 ), and local min = +∞. Thus, in this case global min stores
the smallest integer seen in the first i − 1 multisets. Now, in the first i multisets,
if any special triple p, q, r exists such that p ∈ Mj (j < i) and q, r ∈ Mi , then
CHECK PLC correctly outputs F AIL since global min ≤ p. Otherwise we have two
cases, depending upon the value of A. If |A| = 1, then the variables global min, temp,
and local min are updated so that global min holds the smallest value in the first i
multisets. Also, local min now correctly holds the smallest value a for which there
exists a b (stored in temp) for which a < b and a ∈ Mj , b ∈ Ml with 1 ≤ j < l < i. In
the other case |A| = 0, in which case global min is updated to hold M in(Mi ) (which
is the smallest value in the first i multisets).
Observe that once temp is updated to store a nonzero value, it never stores a 0
again. Thus, once temp is set to a nonzero value in iteration i′ , then from iteration
i′ + 1 to iteration k, Step 2 is executed.
Assume that Step 2 is executed in some iteration i′ and assume, inductively, that
at the start of iteration i′ , global min stores the smallest value in the first i′ − 1
multisets and local min stores the smallest value a for which there exists a b (stored
in temp) such that a < b and a ∈ Mj , b ∈ Ml with 1 ≤ j < l < i′ . Then in iteration i′ ,
it can be easily seen that CHECK PLC correctly outputs F AIL if there exist a special
triple p, q, r such that p ∈ Mi1 , q ∈ Mi2 (i1 < i2 < i′ ), r ∈ Mi′ or p ∈ Mi1 , q, r ∈ Mi′
(i1 < i′ ). Otherwise, for both the cases when |A| = 1 and |A| = 0, Step 2 ensures
that after iteration i′ , global min stores the smallest value in the first i′ multisets and
local min stores the smallest value a for which there exists a b (stored in temp) such
that a < b and a ∈ Mj , b ∈ Ml with 1 ≤ j < l ≤ i′ .
Using the above arguments, it can be seen that CHECK PLC gives the correct
output on any sequence of multisets.
Thus we also have the following theorem.
Theorem 4.10. Given two millipedes T1 and T2 , we can check if their PLC is a
star in linear time.
4.4. Summary. We have used three general techniques in constructing local
consensus trees for these two total local consensus rules:
• we characterize the local consensus tree (that is, we define the set C(T ) of
binary characters which encode the consensus tree T );
• we use the character encoding of the consensus tree if possible to construct
the tree efficiently; and
• we verify that the constructed tree is the local consensus tree.
Some comments about the construction phase are in order. When working with
conservative local consensus functions, assuming the local consensus tree T exists,
it is possible to construct the local consensus tree T in two phases: a refinement
phase in which one of the input trees Ti is refined to produce a tree T ∗ satisfying
C(T ∗ ) = C(Ti ) ∪ C(T ) and then edges are contracted in T ∗ to produce a tree T ∗∗
such that C(T ∗∗ ) = C(T ).
5. Optimization problems.
5.1. Introduction. The local consensus rules we have seen so far are such that
the output tree satisfying the constraints of a particular local consensus rule need
not exist. Yet characterizing these rules and developing fast algorithms for them
are important because if the consensus tree exists, then we can say something very
concrete about it. The nonexistence of the consensus tree in all cases does motivate
the need to look at the optimization versions of local consensus, where solutions
COMPUTING THE LOCAL CONSENSUS OF TREES
1715
always exist. We will now describe some natural optimization problems for local
consensus tree construction. In these problems, which we call relaxed versions, we will
consider certain constraints to be absolutely required and let others be desirable but
not required. Then we seek a tree meeting all the required constraints and as many
of the desirable constraints as possible. We now define some obvious relaxed versions
but note that many other versions are equally desirable.
Recall our discussion in section 3.3 regarding a profile being constant, compatible,
and incompatible on a triple. The first optimization problem we consider is where
we insist that all triples on which the profile is incompatible or is unresolved and
constant are left unresolved, and then we seek to leave as resolved a maximal set of
triples on which the profile is constant and resolved. This is relaxed version I (RV-I).
The second problem is where we insist that all the triples, which the profile leaves as
resolved and constant, be left resolved the same, and then we seek to leave a maximal
set of the remaining triples as unresolved in the consensus tree. This is relaxed version
II (RV-II). The third problem is where we insist that all triples on which the profile
is incompatible or leaves unresolved and constant are left unresolved, and we seek to
leave as resolved a maximal set of triples on which the profile is constant and resolved
or is compatible. This is RV-III. In addition, RV-III also insists that all the resolved
triples in the consensus tree be compatible with the profile. Finally, we look at an
interesting rule LCR1, where we insist that all triples be left resolved on which the
profile is constant and resolved or is compatible. This tries to capture the optimistic
features of the OLC model. Unfortunately, the consensus tree need not always exist.
We give a counterexample to show this.
5.2. Specific relaxed versions.
Definition 5.1. Let T1 and T2 be two rooted trees (not necessarily binary) on
the same leaf set S. A rooted tree T is called an RV-I of T1 and T2 if whenever a
triple a, b, c has differing topologies on T1 and T2 , or both T1 and T2 leave a, b, c as
unresolved, then that triple is unresolved in T and in addition T preserves the topology
of a maximal set of triples which are resolved identically in T1 and T2 .
To prove the existence of an RV-I tree it is sufficient to show that there exists a
tree where every triple on which T1 and T2 disagree is unresolved. The set of trees
with this property can be partially ordered based on the set of triples (on which T1
and T2 agree) whose topology they preserve. Once this partial order is known to be
nonempty, we have proved the existence of an RV-I since any maximal element in this
partial order is such a consensus tree.
We note that if T has the star topology it leaves unresolved all triples on which
T1 and T2 disagree. Hence the partial order is nonempty and the RV-I tree always
exists. In section 5.3 we show that this tree is unique.
Definition 5.2. Let T1 and T2 be two rooted trees (not necessarily binary) on
the same leaf set S. A rooted tree T is called an RV-II of T1 and T2 if T preserves
the topology of all triples which are resolved identically in T1 and T2 . In addition, T
should leave unresolved a maximal set of triples on which T1 and T2 disagree or which
are unresolved in both T1 and T2 .
Using an argument similar to the one used to prove the existence of an RV-I tree
and noting that T1 (or T2 ) itself preserves the topology of all triples on which T1
and T2 agree, we conclude that the RV-II always exists. In section 5.4 we give an
algorithm to construct the RV-II tree.
Definition 5.3. Let T1 and T2 be two rooted trees on the same leaf set S. Let
T be a rooted tree on the same leaf set. Consider the following rules.
1716
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
Rule 1a. If a triple a, b, c is resolved as ((a, b), c) in one tree and as (a, (b, c)) in
the other, we require that it be unresolved.
Rule 1b. If a triple a, b, c is unresolved in both the trees, then we require that it
be unresolved.
Rule 2. If a triple a, b, c is resolved as ((a, b), c) in one tree and is either resolved as
((a, b), c) or unresolved in the other tree, then we require it to be resolved as ((a, b), c).
The tree T is called the relaxed version III (RV-III) of T1 and T2 if
1. it always satisfies Rules 1a and 1b for triples;
2. it also satisfies Rule 2 for a maximal number of triples;
3. if a triple a, b, c is resolved as ((a, b), c) in T , then it is not resolved as (a, (b, c))
or ((a, c), b) in either T1 or T2 .
In section 5.5 we will show that an RV-III tree also always exists and is unique.
In the next subsections, we will look at the different relaxed versions in greater
detail.
5.3. RV-I. In this subsection we will show that the RV-I of two rooted trees T1
and T2 is actually the strict consensus of these two trees.
Theorem 5.1. If T1 and T2 are two rooted trees, then their RV-I tree T always
exists and is identically the strict consensus of T1 and T2 .
Proof. The existence of the RV-I tree T , was shown in section 5.2. Now we show
that this tree is the strict consensus tree. Suppose there exists a triple a, b, c resolved
differently in T1 and T2 as, say, ((a, b), c) and (a, (b, c)) (or (a, b, c)), respectively. Say
the lcaT1 (a, b) = u and lcaT2 (b, c) = v. Clearly, neither αu nor αv is in the strict
consensus tree. Thus the strict consensus tree leaves unresolved any triple that has
different topologies in T1 and T2 .
Let T ′ be a tree in which for every triple a, b, c on which T1 and T2 differ, T ′ has
an unresolved topology on this triple. Now suppose it is possible that T ′ contains a
cluster that is not in C(T1 ) ∩ C(T2 ). Let α be this cluster and suppose without loss
of generality that α is not a cluster of T1 . In T ′ , for any pair of species x, y ∈ α and
species z ∈ α the topology has to be ((x, y), z). However, if this is also the case in
T1 , then T1 must also possess the cluster α contradicting our assumption. Thus there
must exist a pair of species x, y ∈ α and a species z ∈ α such that in T1 their topology
is not ((x, y), z). But this implies that T ′ cannot be an RV-I. Hence any candidate T ′
for an RV-I can only contain the clusters in the intersection of the cluster sets of T1
and T2 .
If T ′ contains a proper subset of the clusters in the intersection of the sets of
clusters of T1 and T2 , then there exists a triple a, b, c on which T ′ has an unresolved
topology while the strict consensus tree has a resolved topology that agrees with the
topologies of T1 and T2 . Hence the strict consensus of T1 and T2 is the RV-I tree of
T1 and T2 .
As a consequence, the RV-I can be constructed in O(n) time using the algorithm
in [9], and there is no need to verify that the tree constructed is correct.
5.4. RV-II. In the RV-II problem we require that any triple on which the trees
T1 and T2 agree must have its topology preserved in the consensus tree T . Further T
should leave unresolved a maximal set of triples on which T1 and T2 disagree or both
leave unresolved.
Previously we showed that the RV-II exists. We note that the RV-II tree is
not unique. The construction of the RV-II can be accomplished by defining the set
A = {((a, b), c) : T1 |{a, b, c} = T2 |{a, b, c} = ((a, b)c)}. This set of rooted triples can
then be passed to the algorithm of Aho et al. [3], which computes a tree (if it exists)
COMPUTING THE LOCAL CONSENSUS OF TREES
1717
having the required form on every triple in the set and also leaving a maximal set of
additional triples outside that set unresolved. The algorithm in [3] takes O(pn) time
where p = |A|. Recall the proof of Theorem 3.1 for a description of the algorithm.
Since in our case p ∈ O(n3 ), the use of the algorithm of [3] would result in a running
time of O(n4 ). We will obtain a speedup to an O(n2 ) algorithm (which includes the
verification) for the construction of the RV-II tree by using the fact that the tree
necessarily exists.
5.4.1. An improved algorithm for RV-II. We will now describe an O(n2 )
time algorithm to construct an RV-II tree. We start by making a few observations
about the RV-II tree T constructed by the algorithm of [3].
We will use α’s to denote the clusters in T1 and β’s to denote the clusters in
T2 . Suppose α and β are maximal clusters in T1 and T2 , respectively, and suppose
α ∪ β = S. Then we claim that α ∩ β (if nonempty) will be a maximal cluster in T .
This is because ∃a ∈ S −(α ∩β) such that ∀x, y ∈ (α ∩β), T1 |{x, y, a} = T2 |{x, y, a} =
((x, y), a) and thus the elements of (α ∩ β) all belong to one component of the graph
which is constructed in the execution of the algorithm of [3]. Furthermore, (α ∩ β) is
exactly equal to one component of this graph since the algorithm never adds an edge
between two nodes in the graph unless it is forced to and it can be seen that no element
x in (α∩β) is such that ∃y, a ∈ S −(α∩β) with T1 |{x, y, a} = T2 |{x, y, a} = ((x, y), a).
Thus, if α ∪ β = S, then α ∩ β (if nonempty) is a maximal cluster in T . The case
where α ∪ β = S, α ∩ β = ∅, can occur for at most one child of the root of T1 and one
child of the root of T2 as the following lemma shows.
Lemma 5.1. Let T1 and T2 be two trees on the same leaf set S. Let α1 , . . . , αk
be the maximal clusters of T1 and β1 , . . . , βl be the maximal clusters of T2 . Then the
case where αi ∪ βj = S, αi ∩ βj = ∅ can occur for at most one i and one j.
Proof. Suppose not. Let αi ∪βj = S, αi ∩βj = ∅, αi∗ ∪βj ∗ = S, and αi∗ ∩βj ∗ = ∅,
perforce with i = i∗ and j = j ∗ . Since αi ∩ αi∗ = ∅, we have that αi ⊆ βj ∗ . But since
αi ∩ βj = ∅, this implies that βj ∩ βj ∗ = ∅. This is a contradiction since βj and βj ∗
are clusters defined by the children of the root and hence should be disjoint.
Recall that the maximal clusters form a partition of the species set S (in each of
T1 , T2 , and T ). Also, from the above discussions we have that (i) α ∪ β = S implies
that α ∩ β is a maximal cluster in T and (ii) there can be at most one case for which
α ∪ β = S. These observations imply that in the case when α ∪ β = S, then α ∩ β is
the union of some maximal clusters of T .
With the above characterization a high-level description of the algorithm to construct T can be given as follows.
RV-II Construction Algorithm
1. For each pair of maximal clusters α ∈ C(T1 ) and β ∈ C(T2 ) such that α∩β = ∅
and α ∪ β = S, recursively compute the tree on α ∩ β and make its root a
child of the root of T .
2. If there are maximal clusters α and β such that α ∪ β = S but α ∩ β =
∅, compute the partition of α ∩ β; recursively compute the tree for each
component of the partition and make the roots of these trees children of the
root of T .
Computing the partition of α ∩ β in step 2 is described together with the implementation details.
Implementation details and running time analysis. Note that this algorithm does
not require an explicit verification of the constructed tree, since in fact we know that
the tree exists and we are simply computing it by mimicking efficiently what the
algorithm in [3] would create.
1718
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
There are at most n recursive stages. We will show that each stage can be
implemented in O(n) time thereby proving the O(n2 ) bound.
To handle case 1 it is important not to waste time on empty intersections. So
we consider each species in turn and label the intersection in which that this species
lies. Thus we will identify at most n nonempty intersections. Let α ∩ β be one such
intersection. To recurse, we need to find homeomorphic subtrees of T1 and T2 that
have α ∩ β as the leaf set. We will show how to do this in time proportional to the
number of leaves in α ∩ β.
Assume that T1 and T2 have been preprocessed for LCA queries. Also note that
we know the left-to-right ordering of all leaves of T1 as well as of T2 . Given the leaves
in α∩β, their left-to-right ordering is also known and is the one induced by the overall
left-to-right ordering. By Lemma 4.2 we can reconstruct the topology of the tree in
linear time.
Thus case 1 can be handled in O(n) time.
We now describe how to handle case 2 also in O(n) time. We will construct a
graph G = (V, E) such that V (G) = α ∩ β. The edges will be added so that, finally,
each component in G corresponds to a maximal cluster in the RV-II tree.
T1
Node defining cluster α
v
T2
Node defining cluster β
u
Fig. 6. Figure showing nodes v and u.
Identify the LCA, say, u, of the species in S − α in T2 and similarly the LCA, say,
v, of the species in S − β in T1 . In T2 , u will be a descendent of the node defining β,
and in T1 , v will be a descendent of the node defining α. See Figure 6. In T1 let v1
through vp be the nodes in the path from the root to v, where v1 = root and vp = v.
Similarly, in T2 , let u1 through uq be the nodes in the path from the root to u, where
COMPUTING THE LOCAL CONSENSUS OF TREES
1719
u1 = root and uq = u. We will say that δ is a special cluster if for some vi , 1 ≤ i ≤ p
(or some uj , 1 ≤ j ≤ q), δ is a cluster defined by a child of vi (or uj ) that is not on
the path from the root to v (or u).
Let δ1 , . . . , δl be the special clusters in T1 and let γ1 , . . . , γm be the special clusters
in T2 . A pair of species x, y ∈ (α ∩ β) will be in the same component of the graph G
if ∃z such that T1 |{x, y, z} = T2 |{x, y, z} = ((x, y), z). There are two cases depending
on whether z ∈ (α ∩ β) or not. We will now describe how to handle these two cases:
Cases 2a and 2b.
Case 2a [z ∈
/ (α ∩ β)]. In this case, it suffices to look at all α ∩ γi and β ∩ δj , and
for each intersection put its elements in the same component of G. This is evident
from the following lemma.
Lemma 5.2. Let α, β be maximal clusters of T1 and T2 , respectively, such that
α ∪ β = S and let x, y ∈ (α ∩ β). Then ∃z ∈ S − (α ∩ β) such that T1 |{x, y, z} =
T2 |{x, y, z} = ((x, y), z) iff both x and y belong to some α ∩ γi or β ∩ δj .
Proof. Suppose ∃z ∈ S − (α ∩ β) such that T1 |{x, y, z} = T2 |{x, y, z} = ((x, y), z).
Since α ∪ β = S, the only cases we have to consider are when z is in exactly one of α
or β. So suppose z ∈ α, z ∈ S − β (the other case can be handled similarly). Then
z belongs to a special cluster δi , which is defined by some child of the node v in T1 .
(Recall that node v is the LCA of S − β in T1 .) Since T1 |{x, y, z} = ((x, y), z), we
have that either both x, y belong to δi or neither belongs to δi . If both x, y ∈ δi , then
clearly x, y ∈ (β ∩ δi ). For the case when neither x nor y is in δi , we can conclude
that both x, y are in some special cluster δj (since T1 |{x, y, z} = ((x, y), z)). Thus we
have that x, y ∈ (β ∩ δj ).
Suppose x, y belong to some α ∩ γi or β ∩ δj ; specifically, say x, y belong to some
β ∩ δj . There are two cases to handle. The first case is if the node v ′ defining the
special cluster is not a child of the node v. In this case, we can pick a species z ∈ S −β
such that T1 |{x, y, z} = T2 |{x, y, z} = ((x, y), z). The second case is when the node
v ′ is a child of the node v. In this case, pick a species z ∈ S − β from the special
cluster which is defined by a node v ′′ (where v ′ = v ′′ ) and v ′′ is below v. We have
that T1 |{x, y, z} = T2 |{x, y, z} = ((x, y), z). Thus in both the cases we have that there
exists such a z with z ∈ S − (α ∩ β).
Thus, for each i, connect all vertices in α ∩ γi (in G) by a path and do the same
for each j and the vertices in β ∩ δj . Note that this can be done in O(n) by using the
same idea as in Case 1.
Case 2b [z ∈ (α ∩ β)]. Note that we are only interested in identifying x, y such
that lca(x, y) in T1 is a node that is on the path from the root of T1 to the node v,
and the lca(x, y) in T2 is a node that is on the path from the root of T2 to the node
u. To see why, if, say, lcaT1 (x, y) = vi ∀1 ≤ i ≤ p, then ∃a ∈ S − β (i.e., a ∈
/ (α ∩ β))
such that T1 |{x, y, a} = T2 |{x, y, a} = ((x, y), a), and thus x and y will be in the same
component after Case 2a is handled.
From the preceding discussion, it suffices to convert the trees T1 and T2 , both
defined on the leaf set (α ∩ β), into millipedes T1′ and T2′ , respectively. T1′ is obtained
from T1 by contracting all edges above internal nodes not in the set {v1 , v2 , . . . , vp }.
T2′ is obtained from T2 similarly. Thus, we have to solve the following problem now:
we are given two millipedes T1′ and T2′ on the same leaf set S ′ = (α ∩ β), where T1′ has
internal nodes labeled v1′ (root of T1′ ) through vp′ , and each vi′ has leaves corresponding
to all the species in the special clusters of vi in T1 ; T2′ has internal nodes labeled u′1
(root of T2′ ) through u′q and is defined similarly. Our aim is to construct a graph
G′ = (V ′ , E ′ ) where V ′ = S ′ such that if ∃x, y, z ∈ (α ∩ β) such that both T1′ and
1720
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
T2′ resolve this triple as ((x, y), z) then x and y will be in the same component of
G′ . Once G′ is known, we add the edges of G′ to the edge set of G, and then the
components in G will give the maximal clusters we seek.
′
We will show how G′ can be constructed in O(n) time. Consider a node vi−1
′
′
′
′
in T1 and let A be the set of leaves of vi−1 . Let uj−1 be the node in T2 which is
closest to u′1 and is the parent of some species in A. Then, clearly, in G′ all species in
′
(αvi′ ∩ βu′j ) need to be in one component. For every vi−1
(2 ≤ i ≤ p), we will denote
′
′
this intersection by the pair (vi , uj ). Further, observe that if (vi′ , u′j ) and (vi′′ , u′j ′ ) are
such that vi′′ is not above vi′ and u′j ′ is not above u′j , then (αv′′ ∩ βu′ ′ ) ⊆ (αvi′ ∩ βu′j ).
i
j
Thus, when constructing the graph G′ , we need only look at all the intersections
of the form (vi′ , u′j ), where for every pair of intersections (vi′′ , u′j ′ ) and (vi′ , u′j ), vi′ is
closer to v1′ than vi′′ is, iff u′j ′ is closer to u′1 than u′j is.
Let (v1∗ , u∗1 ), (v2∗ , u∗2 ), . . . , (vr∗ , u∗r ) be the intersections we are interested in, where
∗
∗
vi is closer to v1′ than vi+1
is (1 ≤ i ≤ (r − 1)), and u∗j+1 is closer to u′1 than u∗j is
(1 ≤ j ≤ (r − 1)). Note that v1∗ = v2′ . This node and the given T1′ and T2′ , uniquely
determine these intersections.
In T1′ , we define the nearest parent of a species x to be the first vi∗ to appear on
the path from x to the root of T1′ . Similarly, we can define the nearest parent of a
species in T2′ . The nearest parents of all the species can be computed in O(n) by doing
a simple traversal of T1′ and T2′ . Using the nearest parents of the species in T1′ , we
partition the species set into r sets Sv1∗ , . . . , Svr∗ where Svi∗ contains all species which
have nearest parent as vi∗ .
Observe that if any two intersections (vi∗ , u∗i ) and (vi∗′ , u∗i′ ) contain at least one
species in common, then all the species in the two intersections need to be in the same
component in G′ . Inductively, if there are intersections (vi∗ , u∗i ), . . . , (vj∗ , u∗j ) such that
the species in these intersections need to be in one component in G′ and if there is an
intersection (vk∗ , u∗k ) which has a species x in common with one of these intersections,
then all the species in the intersection (vk∗ , u∗k ) need to be in the same component as
the species in the intersections (vi∗ , u∗i ), . . . , (vj∗ , u∗j ). The algorithm CONSTRUCT G′
we present now keeps track of such an x using the variable missing link, which is
initialized to an x ∈ (vr∗ , u∗r ) such that the nearest parent of x in T2′ (say u∗j ) is
farthest from the root (as compared with the nearest parents of the other species in
(vr∗ , u∗r )). We will also use two additional variables: np missing link which stores u∗j
and upper limit which stores vj∗ .
Procedure CONSTRUCT G′
For i = r down to 1,
do{
Identify y ∈ Svi∗ such that the nearest parent of y in T2′ is
farthest away from u′1 (i.e., root of T2′ )
Let u∗k be the nearest parent of y in T2′ ; Set z = vk∗
Connect all x ∈ Svi∗ to y
If upper limit is not below vi∗ ,
then
connect y to missing link
else if (upper limit is below vi∗ ) or (upper limit is below vk∗ )
then
set missing link = y
np missing link = u∗k
upper limit = z
}enddo
COMPUTING THE LOCAL CONSENSUS OF TREES
1721
Once we have constructed G′ , we can update G by setting E(G) = E(G) ∪
E(G′ ). The components in G will be the maximal clusters of the RV-II. Finding the
components takes O(n). To recurse, we find the homeomorphic subtrees of T1 and T2
induced by the species in each of the maximal clusters. This can be done in O(n) as
previously described.
Thus the RV-II can be constructed in O(n2 ).
5.5. RV-III.
Lemma 5.3. The RV-III tree T of two trees T1 and T2 always exists and is unique.
Further C(T ) = A, where A = {γ|γ = α ∩ β, α ∈ C(T1 ), β ∈ C(T2 ), γ compatible with
C(Ti ), i = 1, 2}.
Proof. We will first show that A as defined above is a compatible set. The
uniqueness will then follow from the uniqueness of a set of compatible clusters [17].
Pick two clusters γ1 = α1 ∩ β1 and γ2 = α2 ∩ β2 such that γi ∈ A; α1 , α2 ∈
C(T1 ); β1 , β2 ∈ C(T2 ). We will show that γ1 ∩ γ2 ∈ {∅, γ1 , γ2 }. Now, since γi is
compatible with C(T1 ) and C(T2 ), we have γ1 ∩ α2 ∈ {∅, γ1 , α2 }. Also, we have
γ1 ∩ β2 ∈ {∅, γ1 , β2 }. There are several cases to handle. The first case is when
γ1 ⊆ α2 , γ1 ⊆ β2 . In this case, γ1 ⊆ (α2 ∩ β2 ) or γ1 ∩ γ2 = γ1 . The second case is
when γ1 ⊇ α2 , γ1 ⊇ β2 . In this case, (α2 ∩ β2 ) ⊆ γ1 or γ1 ∩ γ2 = γ2 . The third case is
when γ1 ⊆ α2 , γ1 ⊇ β2 . In this case, (α2 ∩ β2 ) ⊆ γ1 and thus γ1 ∩ γ2 = γ2 . Hence, A
is a compatible set of clusters.
Now we will show that any tree T satisfying the RV-III rules will have its cluster
encoding equal to A. From the third requirement for RV-III,5 all the clusters in C(T )
are compatible with both C(T1 ) and C(T2 ). Now suppose we can pick a γ ∈ C(T )−A.
This means that γ = αi ∩ βj ∀αi ∈ C(T1 ), βj ∈ C(T2 ). Let α1 and β1 be the minimal
clusters in T1 and T2 , respectively, containing γ. Clearly, α1 ∩ β1 ⊃ γ. Let u and v be
the nodes in T1 and T2 , respectively, which define the clusters α1 and β1 . Since γ is
compatible with C(T1 ) and C(T2 ), it follows that we can pick three species a, b, c such
that lcaT1 (a, b) = lcaT1 (a, c) = lcaT1 (b, c) = u, lcaT2 (a, b) = lcaT2 (a, c) = lcaT2 (b, c) =
v, and a, b ∈ γ, c ∈ (α1 ∩ β1 ) − γ. In both T1 and T2 , the triple a, b, c is unresolved,
but it is resolved as ((a, b), c) in T , thus contradicting the assumption that T ′ satifies
the rules defined by RV-III. Thus we have that C(T ) ⊆ A. Now suppose C(T ) ⊂ A.
Then it can be seen that we can pick a triple a, b, c which is resolved in T1 and is
either resolved the same in T2 or is unresolved in T2 but that a, b, c is unresolved in
T . This contradicts the assumption that T satisfies the rules defined by RV-III since
it does not satisfy the second (see definition of RV-III) for a maximal set of triples.
Thus C(T ) = A.
Lemma 5.4. The RV-III tree T of two rooted trees can be computed in O(n3 ).
Proof. We can compute C(T ) in O(n3 ) as follows. The set X = {γ|γ = α ∩ β, α ∈
C(T1 ), β ∈ C(T2 )} can be computed in O(n3 ), since there are O(n2 ) pairs to look
at and each α ∩ β can be computed in O(n). The set Y = {γ|γ ∈ X, γ compatible
with C(Ti )} can be computed from X in O(n3 ), since each of the O(n2 ) clusters in X
can be checked for compatibility with C(Ti ) in O(n). Finally, T can be constructed
from Y using the O(n2 ) algorithm mentioned in [17]. Thus the total time taken is
O(n3 ).
We now briefly discuss another local consensus rule that looks interesting but
unfortunately does not always exist. We define LCR1 as a rule which requires that if
5 If a triple a, b, c is resolved as ((a, b), c) in T , then it is not resolved as (a, (b, c)) or ((a, c), b) in
either T1 or T2 .
1722
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
a triple a, b, c is resolved as (a, (b, c)) in one tree and is either resolved as (a, (b, c)) or
unresolved in the second tree, then it is resolved as (a, (b, c)) in the consensus tree.
Although the above rule tries to capture the optimistic features of the input
trees and at the same time is not a total local consensus rule, it is the case that
the consensus tree defined by LCR1 need not exist. See Figure 7 for an example
showing that LCR2 need not necessarily produce a tree. Figure 7(iii) shows the
graph constructed by the algorithm in [3]. Since the graph is connected, it follows
that the set of triple constraints does not define a tree.
e
j a
g
d f
h
d h a
(i)
f
e g
j
(ii)
a
e
d
h
j
g
f
(iii)
Fig. 7. Example showing that consensus tree defined by LCR1 need not exist.
6. Discussion and conclusions. Several approaches have been taken to handle
the problem of resolving multiple solutions. One approach has been to find a maximum
subset S0 ⊆ S inducing homeomorphic subtrees; this subtree is then called a maximum
agreement subtree [19, 13, 24, 14]. The primary disadvantage of this approach is that
it does not return an evolutionary tree on the entire species set.
The other approach which we take here requires that the resolution of the inconsistencies be represented in a single evolutionary tree for the entire species set. A
classical problem in this area is the tree compatibility problem (also called the cladistic
character compatibility problem) [10, 11, 12]. The tree compatibility problem says
that the set T of trees is compatible if a tree T exists such that C(T ) = ∪Ti ∈T C(Ti ).
Equivalently, if a tree T exists such that for every triple A ⊆ S, T resolves A iff
T |A = Ti |A for every Ti ∈ T which resolves A. This problem can be solved in linear
time [17, 25]. The weakness of this approach is that in practice many data sets are
incompatible, and it is therefore necessary to be able to handle the case where some
pairs of trees resolve triples differently.
Some other approaches of this type are the strict consensus [4, 9] and the median
tree [5] problems. These models are stated in terms of unrooted trees, so that instead
of clusters, characters (i.e., bipartitions) on the species set are used to represent the
COMPUTING THE LOCAL CONSENSUS OF TREES
1723
trees. Using the character encoding of the consensus tree as a measure of fitness to the
input, the strict consensus seeks a tree with only those characters that appear in every
tree in the input. The median tree, on the other hand, is defined by a metric d(T1 , T2 )
between rooted trees which is defined to be the cardinality of the symmetric difference
of the character
sets of T1 and T2 . Given input trees T1 , . . . , Tk , T is the median tree
if it minimizes i d(T, Ti ). The median tree can be computed in polynomial time
and has a nice characterization in terms of the character encoding [5, 23, 9]. Both
the above notions are related to versions of the local consensus problem (for example,
the relaxed versions RV-I and RV-III), and the relevant local consensus trees in many
cases contain at least as much “information” as these trees.
The work represented in this paper can be extended in several directions. As we
have noted, for all local consensus functions the local consensus tree of a set of k trees
can be computed in time polynomial in k and n = |S|. Many of these local consensus
trees can be constructed in O(kn) time.
REFERENCES
[1] E. Adams III, N-trees as nestings: Complexity, similarity, and consensus, J. Classification, 3
(1986), pp. 299–317.
[2] A. Aho, J. Hopcroft, and J. Ullman, The Design and Analysis of Computer Algorithms,
Addison–Wesley, Reading, MA, 1974.
[3] A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman, Inferring a tree from lowest
common ancestors with an application to the optimization of relational expressions, SIAM
J. Comput., 10 (1981), pp. 405–421.
[4] J. Barthélemy and F. Janowitz, A formal theory of consensus, SIAM J. Discrete Math., 3
(1991), pp. 305–322.
[5] J. Barthélemy and F. McMorris, The median procedure for n-Trees, J. Classification, 3
(1986), pp. 329–334.
[6] W. Brown, E. M. Prager, A. Wang, and A. C. Wilson, Mitochondrial DNA sequences of
primates: Tempo and mode of evolution, J. Mol. Evol., 18 (1982), pp. 225–239.
[7] D. Bryant and M. Steel, Extension operations on sets of leaf-labelled trees, Research report
118, Department of Mathematics and Statistics, University of Canterbury, Christchurch,
New Zealand, 1994.
[8] H. Colonius and H. H. Schulze, Tree structures for proximity data, British J. Math. Statist.
Psych., 34 (1981), pp. 167–180.
[9] W. H. E. Day, Optimal algorithms for comparing trees with labeled leaves, J. Classification, 2
(1985), pp. 7–28.
[10] G. F. Estabrook, C. S. Johnson, Jr., and F. R. McMorris, An idealized concept of the true
cladistic character, Math. Biosci., 23 (1975), pp. 263–272.
[11] G. F. Estabrook, C. S. Johnson, Jr., and F. R. McMorris, An algebraic analysis of cladistic
characters, Discrete Math., 16 (1976), pp. 141–147.
[12] G. F. Estabrook, C. S. Johnson, Jr., and F. R. McMorris, A mathematical foundation for
the analysis of cladistic character compatibility, Math. Biosci., 29 (1976), pp. 181–187.
[13] M. Farach and M. Thorup, Optimal evolutionary tree comparison by sparse dynamic programming, in Proc. 35th Annual Symposium on Foundations of Computer Science, IEEE
Computer Society Press, Piscataway, NJ, November 1994, pp. 770–779.
[14] M. Farach, T. Przytycka, and M. Thorup, On the agreement of many trees, Inform. Process.
Lett., 55 (1995), pp. 297–301.
[15] J. Felsenstein, Numerical methods for inferring evolutionary trees, Quart. Review of Biology,
57 (1982), pp. 379–404.
[16] C. R. Finden and A. D. Gordon, Obtaining common pruned trees, J. Classification, 2 (1985),
pp. 225–276.
[17] D. Gusfield, Efficient algorithms for inferring evolutionary trees, Networks, 21 (1991), pp. 19–
28.
[18] S. Kannan, E. Lawler, and T. Warnow, Determining the evolutionary tree using experiments, J. Algorithms, 21 (1996), pp. 26–50.
[19] D. Keselman and A. Amir, Maximum agreement subtree in a set of evolutionary trees—
Metrics and efficient algorithms, in Proc. 35th Annual Symposium on Foundations of Com-
1724
SAMPATH KANNAN, TANDY WARNOW, AND SHIBU YOOSEPH
puter Science, IEEE Computer Society Press, Piscataway, NJ, November 1996, pp. 758–769.
[20] D. Harel and R. Tarjan, Fast algorithm for finding nearest common ancestors, SIAM J.
Comput., 13 (1984), pp. 338–355.
[21] M. Henzinger, V. King, and T. Warnow, Constructing a tree from homeomorphic subtrees,
with applications to computational evolutionary biology, in Proc. 7th Annual ACM-SIAM
Symposium on Discrete Algorithms, ACM/SIAM, January 28–30, 1996, pp. 333–340.
[22] G. Nelson, Cladistic analysis and synthesis: Principles and definitions, with a historical note
on Adanson’s Famille des Plantes (1763–1764), Systematic Zoology, 28 (1979), pp. 1–21.
[23] F. McMorris and M. Steel, The complexity of the median procedure for binary trees, in Proc.
4th Conference of the International Federation of Classification Societies, Paris, 1993; Stud.
Classification Data Anal. Knowledge Organ., by Springer-Verlag, to appear.
[24] M. Steel and T. Warnow, Kaikoura tree theorems: Computing the maximum agreement
subtree, Inform. Process. Lett., 48 (1993), pp. 77–82.
[25] T. Warnow, Tree compatibility and inferring evolutionary history, J. Algorithms, 16 (1994),
pp. 388–407.