Last updated: 2019-04-03
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20190110)
The command set.seed(20190110)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 5870e6e
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: ._.DS_Store
Ignored: analysis/cache/
Ignored: build-logs/
Ignored: data/alevin/
Ignored: data/cellranger/
Ignored: data/processed/
Ignored: data/published/
Ignored: output/.DS_Store
Ignored: output/._.DS_Store
Ignored: output/03-clustering/selected_genes.csv.zip
Ignored: output/04-marker-genes/de_genes.csv.zip
Ignored: packrat/.DS_Store
Ignored: packrat/._.DS_Store
Ignored: packrat/lib-R/
Ignored: packrat/lib-ext/
Ignored: packrat/lib/
Ignored: packrat/src/
Untracked files:
Untracked: DGEList.Rds
Untracked: output/90-methods/package-versions.json
Untracked: scripts/build.pbs
Unstaged changes:
Modified: analysis/_site.yml
Modified: output/01-preprocessing/droplet-selection.pdf
Modified: output/01-preprocessing/parameters.json
Modified: output/01-preprocessing/selection-comparison.pdf
Modified: output/01B-alevin/alevin-comparison.pdf
Modified: output/01B-alevin/parameters.json
Modified: output/02-quality-control/qc-thresholds.pdf
Modified: output/02-quality-control/qc-validation.pdf
Modified: output/03-clustering/cluster-comparison.pdf
Modified: output/03-clustering/cluster-validation.pdf
Modified: output/04-marker-genes/de-results.pdf
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. I have use the workflowr
package to organise the analysis and insert reproducibilty information into the output documents. The packrat
package has also been used to manage R package versions and dependencies and conda
used to manage Python environments.
All the code and outputs of analysis are available from GitHub at https://github.com/lazappi/phd-thesis-analysis. If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.
Once you have a local copy of the repository you should see the following directory structure:
analysis/
- Contains the R Markdown documents with the various stages of analysis. These are numbered according to the order they should be run.data/
- This directory contains the data files used in the analysis with each dataset in it’s own sub-directory (see Getting the data for details). Processed intermediate data files will also be placed here.output/
- Directory for output files produced by the analysis, each analysis step has it’s own sub-directory.docs/
- This directory contains the analysis website hosted at http://lazappi.github.io/phd-thesis-analysis, including image files.R/
- R scripts with custom functions used in some analysis stages.scripts/
- Python scripts and examples of how command line tools were run.packrat/
- Directory created by packrat
that contains details of the R packages and versions used in the analysis.env-scanpy.yml
- conda environment for scanpyenv-velocyto.yml
- conda environment for velocyto.pyREADME.md
- README describing the project..Rprofile
- Custom R profile for the project including set up for packrat
and workflowr
..gitignore
- Details of files and directories that are excluded from the repository._workflowr.yml
- Workflowr configuration file.phd-thesis-analysis.Rproj
- RStudio project file.R Packages and dependencies for this project are managed using packrat
. This should allow you to install and use the same package versions as we have used for the analysis. packrat
should automatically take care of this process for you the first time that you open R in the project directory. If for some reason this does not happen you may need to run the following commands:
install.packages("packrat")
packrat::restore()
Note that a clean install of all the required packages can take a significant amount of time when the project is first opened.
The PAGA and cell velocity parts of the analysis require the scanpy and velocyto Python packages. I have used conda
to manage environments for these packages. If you have conda
installed you can set up these environments by running the following commands:
conda env create -f env-scanpy.yml
conda env create -f env-velocyto.yml
The environments can then be activated using conda
:
# To use scanpy
conda activate scanpy
# To use velocyto
conda activate velocyto
In this project I have used the first batch of kidney organoid samples included in GEO accession number GSE114802. The GEO entry contains processed expression matrices from Cell Ranger but for this analysis I started with the raw FASTQ files which can be downloaded from SRA accession SRP148773. Some pre-processing of the dataset was done on the command line to produce datasets in a form suitable for statistical analysis in R. These steps are described on the methods page and examples of commands for these steps are provided in the scripts
directory. If you don’t want to perform these steps yourself you can download the processed data from this Figshare repository. This repository also contains intermediate files from the statistical analysis.
Once the processed data has been has been produced or downloaded it needs to be placed in the correct location. The analysis code assumes the following directory structure inside the data/
directory:
alevin/
- Expression matrices produced by alevin
Org1/
- Alevin output for Organoid 1
quants_mat.gz
- Expression matrixquants_mat_cols.txt
- Column labels for expression matrixquants_mat_rows.txt
- Row labels for expression matrixOrg2/
- Alevin output for Organoid 2 (same files as Org1)Org3/
- Alevin output for Organoid 3 (same files as Org1)cellranger/
- Output from the Cell Ranger pipeline
barcodes.tsv.gz
- Unfiltered list of droplet barcodesfeatures.tsv.gz
- List of annotated features in the datasetfilterered_barcodes.tsv.gz
Filtered list of droplet barcodesmatrix.mtx.gz
- Unfiltered expression matrixprocessed/
- Intermediate files produced during the statistical analysis. This will be produced as code in the R Markdown files is run so aren’t required to run the analysis. Files are numbered according to the document that produces them.
01-selected.Rds
- SingleCellExperiment
object containing selected cells01B-alevin.Rds
- SingleCellExperiment
object containing the dataset produced by alevin02-filtered.Rds
- SingleCellExperiment
object following quality control03-clustered.Rds
- SingleCellExperiment
with cluster labels03-seurat.Rds
- seurat
object used during clustering analysis03-clustered-sel.loom
- Loom
file containing genes selected for clustering and used for PAGA analysis04-markers.Rds
- SingleCellExperiment
object following marker gene detection.04-DGEGLM.Rds
- DGEGLM
object used for edgeR differential expression analysis05-paga.loom
- Loom
file with results from PAGA analysis06-spliced.Rds
- Spliced expression matrix produced by velocyto06-unspliced.Rds
- Unspliced expression matrix produced by velocyto06-tSNE-embedding.Rds
- t-SNE cell velocity embedding from velocyto06-umap-embedding.Rds
- UMAP cell velocity embedding from velocytopublished/
- Results from the previously published analysis of this dataset
cluster_assignments.csv
- Cell cluster assignments from published analysisreferences/
- References mentioned during the analysis and on the website
references.bib
- BibTex file of referencesvelocyto/
- Spliced and unspliced quantification produced by velocyto.py
Org1.loom
- Loom file produced by velocyto for Organoid 1Org2.loom
- Loom file produced by velocyto for Organoid 2Org3.loom
- Loom file produced by velocyto for Organoid 3The analysis directory contains the following analysis files:
01-preprocessing.html
- Reading of datasets produced using Cell Ranger, comparison of droplet selection methods, annotation of the dataset.01B-alevin.html
- Comparison of the alignment-based dataset from Cell Ranger with the same data processed using the alevin method in Salmon.02-quality-control.html
- Selection of high-quality cells and removal of uninformative genes.03-clustering.html
- Clustering using the Seurat package with a comparison of methods for gene selection.04-marker-genes.html
- Marker gene detection for each cluster using edgeR.05-paga.html
- Partition-based graph abstraction using the scanpy Python package with visualisation in R.06-velocyto.html
- Cell velocity estimates using the veloctyo package.90-methods.html
- Description of methods used during the analysis.As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr
.
workflowr::wflow_build(republish = TRUE)
It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis also assume that multiple (up to 10) cores are available for processing. If you have fewer cores available you will need to change the following line in the relevant files and provide the number of cores that are available for use.
bpparam <- MulticoreParam(workers = 10)
It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build()
or by manually knitting the document (for example using the ‘Knit’ button in RStudio).
To avoid having to repeatably re-run long running sections of the analysis I have turned on caching in the analysis documents. However, this comes at a tradeoff with disk space, useability and (potentially but unlikely if careful) reproducibility. In most cases this should not be a problem but it is something to be aware of. In particular there is a incompatibilty with caching and workflowr
that can cause images to not appear in the resulting HTML files (see this GitHub issue for more details). If you have already run part of the analysis (and therefore have a cache) and want to rerun a document the safest option is the use the RStudio ‘Knit’ button.
This reproducible R Markdown analysis was created with workflowr 1.1.1