Rj Mk Nye{en
Download ->>> https://shurll.com/2t2Q3n
Most existing measures of distance between phylogenetic trees are based on the geometry or topology of the trees. Instead, we consider distance measures which are based on the underlying probability distributions on genetic sequence data induced by trees. Monte Carlo schemes are necessary to calculate these distances approximately, and we describe efficient sampling procedures. Key features of the distances are the ability to include substitution model parameters and to handle trees with different taxon sets in a principled way. We demonstrate some of the properties of these new distance measures and compare them to existing distances, in particular by applying multidimensional scaling to data sets previously reported as containing phylogenetic islands.
For \(N=2,3,\ldots \) we define \(U_{N}\) to be the set of unrooted phylogenetic trees on N taxa. More specifically, a tree T is an element of \(U_{N}\) if it satisfies the following conditions. First, T contains exactly N vertices with degree 1, which are called leaves, and these are bijectively labelled \(1,\ldots ,N\). Secondly, T must contain no vertices with degree 2. Thirdly, each edge e in T is assigned a length \(\ell ^e\ge 0\) with \(\ell ^e\ne 0\) if e contains a leaf. An edge in a tree is called a pendant edge if it contains a leaf; otherwise it is called an internal edge. Similarly, the vertices which are not leaves are called internal vertices.
The edge lengths \(\ell ^e\) on any given tree \(T\in U_{N}\) define a path length distance between any pair of leaves. The path length on T between \(u,v\in \{1,\ldots ,N\}\) will be denoted \(\ell _{uv}\).
Each tree \(T\in U_{N}\) contains at most \(2N-3\) edges, in which case the tree is called fully resolved or bifurcating, and all internal vertices have degree 3. Trees with fewer edges are called unresolved, and for \(N>3\), these contain at least one vertex with degree 4 or more. Trees which contain only the N pendant edges joined at a single degree-N internal vertex are called star trees.
A tree T is rooted when some internal point \(\rho \in T\) is labelled as being the root. This is conveniently achieved by adding an additional taxon labelled 0 to the tree via a pendant edge of length zero. It follows that the set of rooted phylogenetic trees satisfies the same conditions as \(U_{N}\), except the leaves are bijectively labelled \(0,1,\ldots ,N\), and the pendant edge containing taxon 0 has zero length. We will work with unrooted trees, but our results are easily transferred to the space of rooted trees via this relationship.
Every fully resolved tree will correspond to a fully resolved BHV-tree (reviewed in Sect. 2.2) and to a fully resolved wald, as introduced below in Sect. 2.4. In both BHV tree space and in wald space, unresolved trees will be identified with other trees with certain internal edges having zero length, so that conceptually a missing edge is the same as a zero length edge.
Billera et al. (2001) defined a space of phylogenetic trees, subsequently known as BHV tree space, and described its geometry. BHV tree space can be described via an embedding in \(\mathbb {R}^d\) for dimension d which increases exponentially with the number of leaves. However, we have chosen to describe BHV tree space in a way different from the original authors, and we define it as a quotient space. As a result, the wald space introduced in the next section is a superset of BHV tree space when the spaces are regarded simply as sets, clarifying the relationship between the two spaces. Importantly, we allow internal edges on trees to have length zero, and under the quotient these are equivalent to trees with those edges missing. A second difference is that while Billera et al. (2001) worked with rooted trees, we work with unrooted trees. As described in Sect. 2.3, the distribution on binary characters determined by a tree does not depend on the root position under the two-state symmetric model, and so unrooted trees are more natural to work with.
BHV tree space is defined using the notion of splits, where a split is a bipartition of the leaf labels \(1,\ldots ,N\) into two disjoint sets. Cutting an edge of a tree induces such a bipartition of the leaves, and so each edge on a tree corresponds to a split, and the terms split and edge can be used interchangeably. The set of splits represented by a tree is called its topology.
Two trees in \(U_{N}\) are equivalent under the relation \(\sim \) when they are identical after internal edges with length zero are removed, and the vertices at the end of every such edge are merged. A, B, C, D represent different subtrees joined by edges of length \(\ell ^A,\ell ^B,\ell ^C,\ell ^D\) to an internal edge with length \(\ell =0\) on the left. The Markov process X(t) cannot change state on any edge with length zero, so the distribution on \(X_1,\ldots ,X_N\) is unchanged by removing such edges in this way
Arbitrary sets of splits do not typically determine valid tree topologies: the splits of a tree must satisfy a compatibility condition. For example, the splits \(\{1,2\},\{3,4,\ldots ,N\}\) and \(\{1,3\},\{2,4,\ldots ,N\}\) are incompatible, since leaf 1 cannot be grouped next to both 2 and 3 on the same tree. For any topology \(\tau \) with k internal edges, \(0\le k\le N-3\), the set of trees in \(U_{N}\) with that topology is bijectively parametrized by \(\mathbb {R}_{> 0}^N\times \mathcal {O}_\tau \) where the first term in the product parametrizes the pendant edge lengths that, by definition, are strictly positive, and \(\mathcal {O}_\tau = \mathbb {R}^{k}_{\ge 0}\) parametrizes the internal edge lengths.
The set \(\mathcal {O}_\tau \) is called the orthant associated with topology \(\tau \), and we identify the set of all trees with topology \(\tau \) with \(\mathbb {R}_{> 0}^N\times \mathcal {O}_\tau \). Under this identification, the set of all trees \(U_{N}\), as defined in Sect. 2.1, is the disjoint union
where the first term parametrizes the lengths of the pendant edges and the space \(\mathrm {BHV}_{N}\) parametrizes the topology and internal edge lengths of the BHV-trees. When \(\tau \) is fully resolved, \(\mathcal {O}_\tau \) is called a maximal orthant. Unresolved trees correspond to points on the boundaries of maximal orthants; they can be obtained from fully resolved trees by shrinking internal edge lengths down to zero.
When an internal edge from a fully resolved topology is contracted down to length zero (left to centre), there are two fully resolved topologies which can be obtained by expanding out an alternative edge (right). A, B, C, D represent subtrees. The operation of contracting an internal edge and expanding out an alternative edge is called nearest neighbour interchange. It follows that at each codimension-1 boundary, three maximal orthants are glued together
Since there are \((2N-5)!!\) fully resolved unrooted topologies, \(\mathrm {BHV}_{N}\) can be thought of as being constructed by gluing this number of maximal orthants together along their boundaries, where two points are identified if they correspond to the same tree. For example, when \(N=4\), there are three fully resolved topologies, each of which contains a single internal edge. The space \(\mathrm {BHV}_{4}\) therefore consists of three copies of \(\mathbb {R}_{\ge 0}\) glued together at the origin. The origin corresponds to the star trees, while the location along each of the three copies of \(\mathbb {R}_{\ge 0}\) gives the length of the internal edge in each of the three possible fully resolved topologies. For \(N=5\) there are 15 possible unrooted tree topologies, each of which contains two internal edges. It follows that \(\mathrm {BHV}_{5}\) consists of 15 copies of \(\mathbb {R}^2_{\ge 0}\) glued along their boundaries. At each codimension-1 boundary, three maximal orthants are joined together. This is because when a single internal edge is contracted to length zero, a degree 4 vertex is obtained, and there are 3 possible ways to add in an edge, including the original edge, in order to obtain a fully resolved topology, as illustrated by Fig. 2.
The metric on \(\mathrm {BHV}_{N}\) is constructed as follows. The basic idea is that for trees with the same fully-resolved topology but different vectors of internal edge lengths, say \(\varvec{\ell }_1\) and \(\varvec{\ell }_2\), the distance is the Euclidean distance \(\Vert \varvec{\ell }_1-\varvec{\ell }_2\Vert \), and the corresponding geodesic is the straight line segment in the orthant containing the trees. Billera et al. (2001) showed that there exists a unique shortest path between any two points in \(\mathrm {BHV}_{N}\), for which path length is measured using the Euclidean distance in each orthant, and the length of these defines a metric on \(\mathrm {BHV}_{N}\) which we denote \(d_\mathrm {BHV}\). A metric on \(\mathcal {U}_{N}\), denoted \(d_{\mathcal {U}_{N}}\), is obtained as the product metric when the metric on pendant edges is taken to be the Euclidean distance. An algorithm has been developed which constructs geodesics and calculates their lengths in \(O(N^4)\) time (Owen and Provan 2011).
Genetic sequence evolution is typically modelled using discrete-valued continuous-time Markov processes defined over the edges of a tree T (Yang 2006; Bryant et al. 2005). DNA sequence evolution is modelled by associating to each point \(t\in T\), a random variable X(t) which takes values in an alphabet \(\{A,C,G,T\}\). In this paper, however, we will consider the two-state symmetric Markov process with alphabet \(\varOmega =\{0,1\}\). This simplification is made in order to make the mathematics more tractable and for computational speed. Nonetheless, some of the calculations using the two-state symmetric can readily be performed using DNA models. More details are given in the thesis of Garba (2019) in which simulations show similarity of geometries obtained from the two- and the four-state process. The transition probability of the symmetric two-state model is defined in terms of the path length \(\ell _{t_1t_2}\) between any two points \(t_1,t_2\in T\): 2b1af7f3a8
https://sway.office.com/0LclCIkiYl4u4i21
https://sway.office.com/zGOCue2RTNzvpzvq
https://sway.office.com/H9kn5hxwx8SQngTB
https://sway.office.com/wz3J8cSlGXKORz11
https://sway.office.com/GORnMAV08RHC1KpO
https://sway.office.com/fLva8IoxncA6bMTD
https://sway.office.com/zy4bVQwdCkady3QV
https://sway.office.com/dh3cumAyG6yYAfKh
https://sway.office.com/Yvz9ZoFz08dVfZie
https://sway.office.com/6RlQHBx8dQP661iY
https://sway.office.com/8ONhD9cbK9Q8anMf
https://sway.office.com/DtWwIGEC8PWRAfpI
https://sway.office.com/EgYVdYaZwCNI9ujZ
https://sway.office.com/FaITIIPNsHEkibMl
https://sway.office.com/QJ1oH3FlH1sAnWXr
https://sway.office.com/mRPWiVUnQHjjThqJ
https://sway.office.com/QeBZqyUvAso0oB1c
https://sway.office.com/EOGG2Olhf7PCjIHk
https://sway.office.com/ctsXZJp4KLmnWqh2
https://sway.office.com/z6KYMk4riPLFuGX6
https://sway.office.com/cOEdnK5RTuDcRm4w
https://sway.office.com/E97hQRXF6WJ3d6g7
https://sway.office.com/vcA6BxFdcn9IN7Rr
https://sway.office.com/jLWCQuj0C20ktGl4
https://sway.office.com/dPVINw7oW3nSaBjc
https://sway.office.com/17xwPzV8oUTjcOl6
https://sway.office.com/gYz8Rk9LUwspuiFC
https://sway.office.com/xE8IWdE1EuyDHygo
https://sway.office.com/5VtBiZFeoEJZ7nwt
https://sway.office.com/eZnL95dqnrJVmgYt
https://sway.office.com/hEsWpmLNywfNEMoN