F Analysis methods

F.1 Pre-processing

The Cell Ranger pipeline (v3.0.1) [34] was used to perform sample demultiplexing, barcode processing and single-cell gene counting. Briefly,samples were demultiplexed to produce a pair of FASTQ files for each sample. Reads containing sequence information were aligned using the reference provided with Cell Ranger (v3.0.0) based on the GRCh38 reference genome and ENSEMBL gene annotation (v93). PCR duplicates were removed by selecting unique combinations of cell barcodes, unique molecular identifiers and gene ids with the final results being a gene expression matrix that was used for further analysis. The three samples were aggregated using Cell Ranger with no normalisation and treated as a single dataset containing 945038 droplets. The R statistical programming language (v3.5.0) [228] was used for further analysis. Count data was read into R and used to construct a SingleCellExperiment object (v1.4.1) [229].

F.1.1 Droplet selection

The droplet selection method in this version of Cell Ranger identified 8291 cell containing droplets. This is more than the traditional Cell Ranger approach based on a simple threshold on total counts which identified 7006 cells. The EmptyDrops method [99] available in the DropletUtils package (v1.2.2) was also applied. Droplets with less than 100 total counts were used to construct the ambient RNA profile and an FDR threshold of less than 0.01 was used to select 9858 droplets. The droplets selected by either the Cell Ranger v3 method or EmptyDrops were kept giving a dataset with 10002 cells. Gene annotation information was added from BioMart [220] using the biomaRt package (v2.38.0) [221] and cells were assigned cell cycle scores using the cyclone [222] function in the scran package (v1.10.2) [109].

F.1.2 Alevin

An alternative quantification was performed using the alevin method [91] available as part of the salmon software package (v0.12.0) [12]. This produced a dataset with 10423 cells and 37247 genes.

F.2 Quality control

The scater package (v1.10.1) [101] was used to produce a series of diagnostic quality control plots and select thresholds for filtering. Cells were removed from the dataset if they had \(\log_{10}\) total counts less than 2.309 (4 MADs from the median), \(\log_{10}\) total features expressed less than 2.3659 (4 MADs) or more than 8.2411 percent of counts (4 MADs) assigned to mitochondrial genes. An automated filtering method method available in scater that detects outliers in a PCA of cells based on technical factors was also performed for comparison but not used for filtering. After filtering doublet scores for each cell were calculated using the simulation-based method available in scran and cells with a score greater than 1.5572 were removed. A minimal filter was used to remove genes that did not have at least 1 count in at least 2 cells. After quality control the dataset had 8059 cells and 22739 genes with a median of 7838 counts per cell and a median of 2506 genes expressed.

F.3 Clustering

F.3.1 Gene selection

Two methods were used to select genes to use for clustering. The default selection method in the Seurat package (v2.3.4) [129] uses thresholds on mean expression (between 0.0125 and 3.5) and dispersion (greater than 1), which selected 2457 genes. The M3Drop (v3.10.3) [104] method was also used to test for genes with significantly more zero counts than expected and 1952 genes with an adjusted p-value less than 0.01 were selected. Following comparisons between these two methods the genes identified by M3Drop were used for clustering.

F.3.2 Graph-based clustering

Seurat was used to perform graph-based clustering. PCA was performed using the selected genes and the first 15 principal components were used to construct a shared nearest neighbour graph using the overlap between the 30 nearest neighbours of each cell. Louvain modularity optimisation [133] was used to partition this graph with a resolution parameter between 0 and 1. Clustering tree visualisations [208] were produced using the clustree package (v0.3.0) showing the expression of previously identified marker genes. By inspecting these trees a resolution of 0.4 which chosen which produced 13 clusters. To compare these cluster to those that had been previously published the Jaccard index was calculated between pairs of clusters from each analysis and visualised as a heatmap.

F.4 Marker genes

Marker genes for each cluster were identified using edgeR (v3.24.3) [15,16]. Additional filtering was performed to remove genes with a \(\log_{10}\) maximum average counts in any cluster less than -1.3505. The edgeR negative binomial model was then fitted to the dataset using a design matrix that included the detection rate (scaled proportion of genes expressed in each cell). The quasi-likelihood F-test was used to test cells in one cluster against all other cells. Genes were considered significant markers for a cluster if they had an FDR less than 0.05 and a log fold change greater than 1. Identities were assigned to each cluster by comparing the detected genes to previously published lists of cell type markers.

F.5 Partition-based graph abstraction (PAGA)

Partition-based graph abstraction (PAGA) [226] was performed using the scanpy library (v1.4) [227] for the Python programming language (v3.6.8). A shared nearest neighbour graph was built using the same parameters as were used by Seurat and connectivity between the Seurat clusters was calculated using the PAGA algorithm. A ForceAtlas2 [230] layout of the cell graph was calculated based on the PAGA cluster graph. The results of the PAGA analysis were read into R for visualisation.

F.6 Cell velocity

The velocyto (v0.17.16) [151] Python program was used to calculate spliced and unspliced count matrices from the aligned reads. The resulting Loom files for each sample were then read into R and merged. The remaining analysis was performed using the velocyto R package (v0.6). Genes were removed from the spliced matrix if they had did not have a mean expression greater than 0.2 in at least one cluster or from the unspliced matrix if they did not have a mean expression greater than 0.05 in at least one cluster. Relative velocity estimates were then calculated for each cell and projected onto reduced dimensional spaces for visualisation.

F.7 Other packages

Visualisations and figures were primarily created using the ggplot2 (v3.1.0) [231] and cowplot (v0.9.4) [232] packages using the viridis colour palette (v0.5.1) for continuous data. UpSet plots [218] were produced using the UpSetR package (v1.3.3) [233] with help from the gridExtra package (v2.3) [234] and SinaPlots [224] using the ggforce package (v0.1.3) [235]. Data manipulation was performed using other packages in the tidyverse (v1.2.1) [236] particularly dplyr (v0.7.8) [237], tidyr (v0.7.8) [238] and purrr (v0.3.0) [239]. Loom files were exported from R using the LoomExperiment package (v1.0.2) [240]. The analysis project was managed using the workflowr (v1.1.1) [241] package which was also used to produce the publicly available website displaying the analysis code, results and output. Reproducible reports were produced using knitr (v1.21) [242–244] and R Markdown (v1.11) [245,246] and converted to HTML using Pandoc (v1.17.2).