Building a clustering tree
For my PhD I am working on methods for analysing single-cell RNA-sequencing (scRNA-seq) data which measure the expression of genes in individual cells. One of the most common analyses done on this type of data is to cluster the cells, often in an attempt to find out what cell types are present in a sample.
In a recent seminar I showed some images of what I am calling a “clustering tree” (you can see the slides here if you are interested). This is a visualisation I came up with to show the relationship between clusterings as the number of clusters is increased. A few people asked how I had made it so here is a short example.
Setup
First we need to load the libraries we are going to use:
# Simulation
library(splatter)
# Clustering
library(Seurat) # Installed from https://github.com/satijalab/seurat
# Graphs
library(igraph)
# Plotting
library(ggraph)
library(viridis)
# Data manipulation
library(tidyverse)
For this example I am going to simulate some scRNA-seq data with eight different groups with different numbers of cells using Splatter
.
<- splatSimulateGroups(groupCells = c(100, 80, 60, 50, 30, 20, 20, 15),
sim seed = 10, verbose = FALSE)
Let’s take a quick look at this to see if it is anything like we would expect:
plotTSNE(sim, colour_by = "Group")
Here we can see the different groups, so there should be something for our clustering analysis to find.
The clustering pacakge we are going to use is Seurat
which uses it’s own object to store the data. Here is a small function I have written to convert from the SCESet
object produced by Splatter
to the Seurat
object required by Seurat
.
<- function(sce) {
SCESetToSeurat if (!is(sce,'SCESet')) {
stop("sce must be an SCESet object")
}
<- scater::counts(sce)
counts
if (is.null(counts)) {
stop("sce must contain counts to convert to Seurat")
}
<- new("seurat",
seurat raw.data = counts,
is.expr = sce@lowerDetectionLimit,
data.info = Biobase::pData(sce),
cell.names = Biobase::sampleNames(sce))
return(seurat)
}
Now to convert the dataset.
<- SCESetToSeurat(sim) seurat
Clustering
We now have a dataset in the format required by Seurat
. Before we do any clustering we need to run through some setup steps. I’m not going to explain what they are doing here, if you want to now the details refer to the Seurat
tutorials.
<- Setup(seurat, project = "Example", meta.data = seurat@data.info)
seurat <- MeanVarPlot(seurat, fxn.x = expMean, fxn.y = logVarDivMean,
seurat x.low.cutoff = 0.1, x.high.cutoff = 3,
y.cutoff = 0.5, do.contour = FALSE)
<- PCA(seurat, pc.genes = seurat@var.genes, do.print = FALSE) seurat
Now we can do the clustering. The parameter we are interested in is the resolution
parameter which controls how many clusters Seurat
returns. I start by setting resolution = 0
. This will create a cluster containing all cells that will serve as the root of our tree. We also ask Seurat
to store some of the intermediate calculations so we don’t have to do them again when we cluster with different resolutions:
<- FindClusters(seurat, pc.use = 1:20, resolution = 0, algorithm = 3,
seurat print.output = FALSE, save.SNN = TRUE)
We can now loop over a range of resolutions that we are interested in. I have only tried a few values here but if this was a real dataset you might want to try some more.
for (res in c(0.3, 0.6, 0.9, 1.2)) {
<- FindClusters(seurat, resolution = res, algorithm = 3,
seurat print.output = FALSE)
}
Get results
Seurat
stores the cluster labels in the data.info
slot in columns starting with res.
. This is the only part we are interested in so let’s extract just those columns.
<- seurat@data.info %>% select(contains("res."))
clusterings
head(clusterings)
## res.0 res.0.3 res.0.6 res.0.9 res.1.2
## Cell1 0 1 0 0 0
## Cell2 0 1 0 0 0
## Cell3 0 1 0 0 0
## Cell4 0 1 0 0 0
## Cell5 0 1 0 0 0
## Cell6 0 1 0 0 0
We now know which cluster each cell was assigned to at each resolution but to build the tree we need some more information. This next function looks at two neighbouring resolutions and works out how many cells moved from a cluster in the lower resolution to each cluster in the higher resolution. These transitions are going to form the edges of our tree.
<- function(clusterings) {
getEdges
# Loop over the different resolutions
<- lapply(1:(ncol(clusterings) - 1), function(i) {
transitions
# Extract two neighbouring clusterings
<- sort(colnames(clusterings))[i]
from.res <- sort(colnames(clusterings))[i + 1]
to.res
# Get the cluster names
<- sort(unique(clusterings[, from.res]))
from.clusters <- sort(unique(clusterings[, to.res]))
to.clusters
# Get all possible combinations
<- expand.grid(FromClust = from.clusters,
trans.df ToClust = to.clusters)
# Loop over the possible transitions
<- apply(trans.df, 1, function(x) {
trans <- x[1]
from.clust <- x[2]
to.clust
# Find the cells from those clusters
<- clusterings[, from.res] == from.clust
is.from <- clusterings[, to.res] == to.clust
is.to
# Count them up
<- sum(is.from & is.to)
trans.count
# Get the sizes of the two clusters
<- sum(is.from)
from.size <- sum(is.to)
to.size
# Get the proportions of cells moving along this edge
<- trans.count / from.size
trans.prop.from <- trans.count / to.size
trans.prop.to
return(c(trans.count, trans.prop.from, trans.prop.to))
})
# Tidy up the results
$FromRes <- as.numeric(gsub("res.", "", from.res))
trans.df$ToRes <- as.numeric(gsub("res.", "", to.res))
trans.df$TransCount <- trans[1, ]
trans.df$TransPropFrom <- trans[2, ]
trans.df$TransPropTo <- trans[3, ]
trans.df
return(trans.df)
})
# Bind the results from the different resolutions together
<- do.call("rbind", transitions)
transitions
# Tidy everything up
<- sort(as.numeric(levels(transitions$ToClust)))
levs <- transitions %>%
transitions mutate(FromClust = factor(FromClust,
levels = levs)) %>%
mutate(ToClust = factor(ToClust, levels = levs))
return(transitions)
}
<- getEdges(clusterings)
edges head(edges)
## FromClust ToClust FromRes ToRes TransCount TransPropFrom TransPropTo
## 1 0 0 0.0 0.3 135 0.3600000 1
## 2 0 1 0.0 0.3 100 0.2666667 1
## 3 0 2 0.0 0.3 60 0.1600000 1
## 4 0 3 0.0 0.3 50 0.1333333 1
## 5 0 4 0.0 0.3 30 0.0800000 1
## 6 0 0 0.3 0.6 0 0.0000000 0
Some of these columns are pretty obvious but the last three could do with an explanation. TransCount
is the number of cells that move along this edge. TransPropFrom
is the proportion of the cells in the lower resolution cluster that have made this transition and TransPropTo
is the proportion of cells in the higher resolution cluster that came from this edge.
Getting the information about the nodes of the tree is easier as these just represent the clusters. This function summarises the cluster information and converts it to long format.
<- function(clusterings) {
getNodes <- clusterings %>%
nodes gather(key = Res, value = Cluster) %>%
group_by(Res, Cluster) %>%
summarise(Size = n()) %>%
ungroup() %>%
mutate(Res = stringr::str_replace(Res, "res.", "")) %>%
mutate(Res = as.numeric(Res), Cluster = as.numeric(Cluster)) %>%
mutate(Node = paste0("R", Res, "C", Cluster)) %>%
select(Node, everything())
}
<- getNodes(clusterings)
nodes head(nodes)
## # A tibble: 6 x 4
## Node Res Cluster Size
## <chr> <dbl> <dbl> <int>
## 1 R0C0 0.0 0 375
## 2 R0.3C0 0.3 0 135
## 3 R0.3C1 0.3 1 100
## 4 R0.3C2 0.3 2 60
## 5 R0.3C3 0.3 3 50
## 6 R0.3C4 0.3 4 30
Each node needs a unique ID which I have made by combining the resolution and cluster number. We also record the number of cells in each cluster.
Now we can build the graph we will use as the starting point for our plot. Some of the possible edges between clusters will have no cells travelling along them so we filter them out. We also remove edges that correspond to a small proportion (< 2%) of cells in the higher resolution cluster.
<- edges %>%
graph # Remove edges without any cell...
filter(TransCount > 0) %>%
# ...or making up only a small proportion of the new cluster
filter(TransPropTo > 0.02) %>%
# Rename the nodes
mutate(FromNode = paste0("R", FromRes, "C", FromClust)) %>%
mutate(ToNode = paste0("R", ToRes, "C", ToClust)) %>%
# Reorder columns
select(FromNode, ToNode, everything()) %>%
# Build a graph using igraph
graph_from_data_frame(vertices = nodes)
print(graph)
## IGRAPH b1b93c3 DN-- 23 23 --
## + attr: name (v/c), Res (v/n), Cluster (v/n), Size (v/n),
## | FromClust (e/c), ToClust (e/c), FromRes (e/n), ToRes (e/n),
## | TransCount (e/n), TransPropFrom (e/n), TransPropTo (e/n)
## + edges from b1b93c3 (vertex names):
## [1] R0C0 ->R0.3C0 R0C0 ->R0.3C1 R0C0 ->R0.3C2 R0C0 ->R0.3C3
## [5] R0C0 ->R0.3C4 R0.3C1->R0.6C0 R0.3C0->R0.6C1 R0.3C4->R0.6C1
## [9] R0.3C0->R0.6C2 R0.3C2->R0.6C3 R0.3C3->R0.6C4 R0.6C0->R0.9C0
## [13] R0.6C2->R0.9C1 R0.6C3->R0.9C2 R0.6C1->R0.9C3 R0.6C4->R0.9C4
## [17] R0.6C1->R0.9C5 R0.9C0->R1.2C0 R0.9C1->R1.2C1 R0.9C2->R1.2C2
## [21] R0.9C3->R1.2C3 R0.9C4->R1.2C4 R0.9C5->R1.2C5
Plot the tree
The last step is to pass our graph to the ggraph
library for plotting.
# Plot our graph using the `tree` layout
ggraph(graph, layout = "tree") +
# Plot the edges, colour is the number of cells and transparency is the
# proportion contribution to the new cluster
geom_edge_link(arrow = arrow(length = unit(1, 'mm')),
end_cap = circle(3.5, "mm"), edge_width = 1,
aes(colour = log(TransCount), alpha = TransPropTo)) +
# Plot the nodes, size is the number of cells
geom_node_point(aes(colour = factor(Res),
size = Size)) +
geom_node_text(aes(label = Cluster), size = 3) +
# Adjust the scales
scale_size(range = c(4, 15)) +
scale_edge_colour_gradientn(colours = viridis(100)) +
# Add legend labels
guides(size = guide_legend(title = "Cluster Size", title.position = "top"),
colour = guide_legend(title = "Clustering Resolution",
title.position = "top"),
edge_colour = guide_edge_colorbar(title = "Cell Count (log)",
title.position = "top"),
edge_alpha = guide_legend(title = "Cluster Prop",
title.position = "top", nrow = 2)) +
# Remove the axes as they don't really mean anything
theme_void() +
theme(legend.position = "bottom")
And here is the result! We can see that see that Seurat
finds three of the clusters easily and that these don’t change as the resolution increases. A fourth group contains most of the cells and is sub-divided as we increase resolution. Interestingly at the lowest resolution there is a small cluster which is then absorbed into one of the other branches.
This tree is cleaner and has less branches than what we would be likely to see with a real dataset but the process to create it would be the same. I have used Seurat
as the clustering method in this example but it should be easy to adapt the process to any other method that allows you to adjust the number of clusters. I have found this visualisation useful in my analysis particularly for looking at which clusters are very distinct and the relationships between different clusters and clusterings.
Good luck creating your own clustering trees!
Session information
::session_info()
devtools
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.1 (2017-06-30)
## system x86_64, darwin15.6.0
## ui RStudio (1.0.143)
## language (EN)
## collate en_AU.UTF-8
## tz Australia/Melbourne
## date 2017-07-19
## Packages -----------------------------------------------------------------
## package * version date source
## AnnotationDbi 1.38.1 2017-06-01 Bioconductor
## ape 4.1 2017-02-14 cran (@4.1)
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.1 2017-07-07 local
## beeswarm 0.2.3 2016-04-25 CRAN (R 3.4.0)
## bindr 0.1 2016-11-13 CRAN (R 3.4.0)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.0)
## Biobase * 2.36.2 2017-05-04 Bioconductor
## BiocGenerics * 0.22.0 2017-04-25 Bioconductor
## BiocParallel 1.10.1 2017-05-03 Bioconductor
## biomaRt 2.32.1 2017-06-09 Bioconductor
## bit 1.1-12 2014-04-09 CRAN (R 3.4.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.4.0)
## bitops 1.0-6 2013-08-17 CRAN (R 3.4.0)
## blob 1.1.0 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
## car 2.1-5 2017-07-04 cran (@2.1-5)
## caret 6.0-76 2017-04-18 cran (@6.0-76)
## caTools 1.17.1 2014-09-10 cran (@1.17.1)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## checkmate 1.8.3 2017-07-03 CRAN (R 3.4.1)
## class 7.3-14 2015-08-30 CRAN (R 3.4.1)
## cluster 2.0.6 2017-03-10 CRAN (R 3.4.1)
## codetools 0.2-15 2016-10-05 CRAN (R 3.4.1)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.1 2017-07-07 local
## cowplot * 0.7.0 2016-10-28 cran (@0.7.0)
## data.table 1.10.4 2017-02-01 CRAN (R 3.4.0)
## datasets * 3.4.1 2017-07-07 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.0)
## DEoptimR 1.0-8 2016-11-19 cran (@1.0-8)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## diptest 0.75-7 2016-12-05 cran (@0.75-7)
## dplyr * 0.7.1 2017-06-22 CRAN (R 3.4.1)
## edgeR 3.18.1 2017-05-06 Bioconductor
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## fastICA 1.2-1 2017-06-12 cran (@1.2-1)
## flexmix 2.3-14 2017-04-28 cran (@2.3-14)
## FNN 1.1 2013-07-31 cran (@1.1)
## forcats 0.2.0 2017-01-23 CRAN (R 3.4.0)
## foreach 1.4.3 2015-10-13 cran (@1.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.1)
## fpc 2.1-10 2015-08-14 cran (@2.1-10)
## gdata 2.18.0 2017-06-06 cran (@2.18.0)
## ggbeeswarm 0.5.3 2016-12-01 CRAN (R 3.4.0)
## ggforce 0.1.1 2016-11-28 CRAN (R 3.4.0)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## ggraph * 1.0.0 2017-02-24 CRAN (R 3.4.0)
## ggrepel 0.6.5 2016-11-24 CRAN (R 3.4.0)
## glue 1.1.1 2017-06-21 CRAN (R 3.4.1)
## gplots 3.0.1 2016-03-30 cran (@3.0.1)
## graphics * 3.4.1 2017-07-07 local
## grDevices * 3.4.1 2017-07-07 local
## grid 3.4.1 2017-07-07 local
## gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0)
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## gtools 3.5.0 2015-05-29 cran (@3.5.0)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
## hms 0.3 2016-11-22 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httpuv 1.3.5 2017-07-04 CRAN (R 3.4.1)
## httr 1.2.1 2016-07-03 CRAN (R 3.4.0)
## igraph * 1.1.1 2017-07-16 CRAN (R 3.4.1)
## IRanges 2.10.2 2017-05-25 Bioconductor
## irlba 2.2.1 2017-05-17 cran (@2.2.1)
## iterators 1.0.8 2015-10-13 cran (@1.0.8)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## kernlab 0.9-25 2016-10-03 cran (@0.9-25)
## KernSmooth 2.23-15 2015-06-29 CRAN (R 3.4.1)
## knitr 1.16 2017-05-18 CRAN (R 3.4.1)
## labeling 0.3 2014-08-23 CRAN (R 3.4.0)
## lars 1.2 2013-04-24 cran (@1.2)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.1)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
## limma 3.32.3 2017-07-16 Bioconductor
## lme4 1.1-13 2017-04-19 cran (@1.1-13)
## locfit 1.5-9.1 2013-04-20 CRAN (R 3.4.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## MASS 7.3-47 2017-02-26 CRAN (R 3.4.1)
## Matrix 1.2-10 2017-05-03 CRAN (R 3.4.1)
## MatrixModels 0.4-1 2015-08-22 cran (@0.4-1)
## matrixStats 0.52.2 2017-04-14 CRAN (R 3.4.0)
## mclust 5.3 2017-05-21 cran (@5.3)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.1 2017-07-07 local
## mgcv 1.8-17 2017-02-08 CRAN (R 3.4.1)
## mime 0.5 2016-07-07 CRAN (R 3.4.0)
## minqa 1.2.4 2014-10-09 cran (@1.2.4)
## mixtools 1.1.0 2017-03-10 cran (@1.1.0)
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## ModelMetrics 1.1.0 2016-08-26 cran (@1.1.0)
## modelr 0.1.0 2016-08-31 CRAN (R 3.4.0)
## modeltools 0.2-21 2013-09-02 cran (@0.2-21)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## mvtnorm 1.0-6 2017-03-02 cran (@1.0-6)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.1)
## nloptr 1.0.4 2014-08-04 cran (@1.0.4)
## nnet 7.3-12 2016-02-02 CRAN (R 3.4.1)
## numDeriv 2016.8-1 2016-08-27 cran (@2016.8-)
## parallel * 3.4.1 2017-07-07 local
## pbapply 1.3-3 2017-07-04 cran (@1.3-3)
## pbkrtest 0.4-7 2017-03-15 cran (@0.4-7)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## prabclus 2.2-6 2015-01-14 cran (@2.2-6)
## psych 1.7.5 2017-05-03 CRAN (R 3.4.1)
## purrr * 0.2.2.2 2017-05-11 CRAN (R 3.4.0)
## quantreg 5.33 2017-04-18 cran (@5.33)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## ranger 0.8.0 2017-06-20 cran (@0.8.0)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
## Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
## RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.0)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
## rhdf5 2.20.0 2017-04-25 Bioconductor
## rjson 0.2.15 2014-11-03 CRAN (R 3.4.0)
## rlang 0.1.1 2017-05-18 CRAN (R 3.4.0)
## rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1)
## robustbase 0.92-7 2016-12-09 cran (@0.92-7)
## ROCR 1.0-7 2015-03-26 cran (@1.0-7)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## RSQLite 2.0 2017-06-19 CRAN (R 3.4.1)
## Rtsne 0.13 2017-04-14 cran (@0.13)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## S4Vectors 0.14.3 2017-06-03 Bioconductor
## scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
## scater * 1.4.0 2017-04-25 Bioconductor
## segmented 0.5-2.1 2017-06-14 cran (@0.5-2.1)
## Seurat * 1.4.0.16 2017-07-19 Github (satijalab/seurat@3bd092a)
## shiny 1.0.3 2017-04-26 CRAN (R 3.4.0)
## shinydashboard 0.6.1 2017-06-14 CRAN (R 3.4.0)
## sn 1.5-0 2017-02-10 cran (@1.5-0)
## SparseM 1.77 2017-04-23 cran (@1.77)
## splatter * 1.0.3 2017-05-27 Bioconductor
## splines 3.4.1 2017-07-07 local
## stats * 3.4.1 2017-07-07 local
## stats4 3.4.1 2017-07-07 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
## survival 2.41-3 2017-04-04 CRAN (R 3.4.1)
## tclust 1.2-7 2017-06-30 cran (@1.2-7)
## tibble * 1.3.3 2017-05-28 CRAN (R 3.4.0)
## tidyr * 0.6.3 2017-05-15 CRAN (R 3.4.0)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
## tools 3.4.1 2017-07-07 local
## trimcluster 0.1-2 2012-10-29 cran (@0.1-2)
## tsne 0.1-3 2016-07-15 cran (@0.1-3)
## tweenr 0.1.5 2016-10-10 CRAN (R 3.4.0)
## tximport 1.4.0 2017-04-25 Bioconductor
## udunits2 0.13 2016-11-17 CRAN (R 3.4.0)
## units 0.4-5 2017-06-15 CRAN (R 3.4.0)
## utils * 3.4.1 2017-07-07 local
## VGAM 1.0-3 2017-01-11 cran (@1.0-3)
## vipor 0.4.5 2017-03-22 CRAN (R 3.4.0)
## viridis * 0.4.0 2017-03-27 CRAN (R 3.4.0)
## viridisLite * 0.2.0 2017-03-24 CRAN (R 3.4.0)
## withr 1.0.2 2016-06-20 CRAN (R 3.4.0)
## XML 3.98-1.9 2017-06-19 CRAN (R 3.4.1)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
## xtable 1.8-2 2016-02-05 CRAN (R 3.4.0)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
## zlibbioc 1.22.0 2017-04-25 Bioconductor