
mixtree
Cyril Geismar
2025-02-25
mixtree.Rmd
Introduction
The mixtree
package provides a statistical framework for
comparing sets of trees. The function tree_test()
, can
apply various hypothesis testing approaches to assess differences
between tree sets. While currently supporting transmission trees, future
updates will expand functionality to include phylogenetic trees and,
more generally, directed acyclic graphs (DAGs) .
Methods
The package implements the following testing methods:
-
PERMANOVA: Evaluates whether the topological distribution of trees differs between sets.
Null Hypothesis (H0): Transmission trees in all sets are drawn from the same distribution, implying similar topologies.
Alternative Hypothesis (H1): At least one set of transmission trees comes from a different distribution.
-
Chi-Square or Fisher’s Exact Test: Evaluates whether the distribution of ancestor-descendant pairs differs between sets.
Null Hypothesis (H0): The frequency of ancestor-descendant pairs is consistent across all sets.
Alternative Hypothesis (H1): The frequency of ancestor-descendant pairs differs between at least two sets.
Input Requirements
Each input set must be a list of data frames. Every data frame represents a tree and must contain exactly two columns:
from
: The parent node (or infector).to
: The child node (or infectee).
make_tree
is a helper function that simulates a DAG with
the number of branches per node drawn from a Poisson distribution with
= R
when stochastic = TRUE
make_tree(20, R = 2, stochastic = TRUE, plot = TRUE)
#> IGRAPH 0586988 D--- 20 19 --
#> + edges from 0586988:
#> [1] 1-> 2 1-> 3 2-> 4 2-> 5 2-> 6 3-> 7 3-> 8 4-> 9 5->10 5->11
#> [11] 6->12 6->13 7->14 7->15 8->16 9->17 9->18 9->19 10->20
Usage
The unified interface is provided by the tree_test()
function. Users can supply two or more sets of trees and select the
desired testing method via the method
parameter.
PERMANOVA
set.seed(123)
# Generate 100 trees with R₀ = 2
chainA <- lapply(1:100, function(i){
make_tree(20, R = 2, stochastic = TRUE) |>
igraph::as_long_data_frame()
})
# Generate 100 trees with R₀ = 4
chainB <- lapply(1:100, function(i){
make_tree(20, R = 4, stochastic = TRUE) |>
igraph::as_long_data_frame()
})
tree_test(chainA, chainB, method = "permanova")
#> Permutation test for adonis under reduced model
#> Permutation: free
#> Number of permutations: 999
#>
#> (function (formula, data, permutations = 999, method = "bray", sqrt.dist = FALSE, add = FALSE, by = NULL, parallel = getOption("mc.cores"), na.action = na.fail, strata = NULL, ...)
#> Df SumOfSqs R2 F Pr(>F)
#> Model 1 8052 0.14429 33.388 0.001 ***
#> Residual 198 47750 0.85571
#> Total 199 55802 1.00000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is below the 5% significance level, we reject the null hypothesis of no difference.
Advanced Usage
The tree_test()
function accepts additional parameters
to customise the testing process:
within_dist
: A function to compute pairwise distances within a tree (used with PERMANOVA). Default ispatristic
.between_dist
: A function to compute the distance between two trees (used with PERMANOVA). Default iseuclidean
.test_args
: A list of extra arguments passed to the underlying test function (i.e.vegan::adonis2
,stats::chisq.test
, orstats::fisher.test
).
Using Custom Distance Functions
The package supports custom distance functions, such as the MRCI
depth measure described in Kendall
et al.(2018). See also the vignette
from treespace
.
library(treespace)
mrciDepth <- function(tree) {
treespace::findMRCIs(as.matrix(tree))$mrciDepths
}
tree_test(chainA, chainB, within_dist = mrciDepth)
#> Permutation test for adonis under reduced model
#> Permutation: free
#> Number of permutations: 999
#>
#> (function (formula, data, permutations = 999, method = "bray", sqrt.dist = FALSE, add = FALSE, by = NULL, parallel = getOption("mc.cores"), na.action = na.fail, strata = NULL, ...)
#> Df SumOfSqs R2 F Pr(>F)
#> Model 1 3723.5 0.14315 33.078 0.001 ***
#> Residual 198 22288.0 0.85685
#> Total 199 26011.5 1.00000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note
Randomly shuffling node IDs will not affect the PERMANOVA test results if the distance functions are invariant to node labelling. Since the test focuses on the tree’s topology and branch lengths rather than the specific identifiers, metrics such as patristic distances—derived solely from the tree structure—remain unchanged when node IDs are permuted. However, if a custom function depends on the order or specific labels of nodes, then shuffling could influence the results.
chainA <- lapply(1:50, function(i) {
make_tree(20, R = 2, stochastic = TRUE)
})
chainB <- lapply(1:50, function(i) {
df <- mixtree:::shuffle_graph_ids(chainA[[i]]) |>
igraph::as_long_data_frame()
subset(df, select = c("from", "to"))
})
chainA <- lapply(chainA, igraph::as_long_data_frame)
tree_test(chainA, chainB, method = "permanova")
#> Permutation test for adonis under reduced model
#> Permutation: free
#> Number of permutations: 999
#>
#> (function (formula, data, permutations = 999, method = "bray", sqrt.dist = FALSE, add = FALSE, by = NULL, parallel = getOption("mc.cores"), na.action = na.fail, strata = NULL, ...)
#> Df SumOfSqs R2 F Pr(>F)
#> Model 1 0 0 0 1
#> Residual 98 29757 1
#> Total 99 29757 1
# In contrast, the Chi-Square test will reject the null as it compare the distribution of of ancestries for each case
tree_test(chainA, chainB, method = "chisq")
#>
#> Pearson's Chi-squared test
#>
#> data: count data
#> X-squared = 778.2, df = 207, p-value < 2.2e-16