Last updated: 2019-04-03

workflowr checks: (Click a bullet for more information)
Expand here to see past versions:


This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. I have use the workflowr package to organise the analysis and insert reproducibilty information into the output documents. The packrat package has also been used to manage R package versions and dependencies and conda used to manage Python environments.

Getting the code

All the code and outputs of analysis are available from GitHub at https://github.com/lazappi/phd-thesis-analysis. If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.

Once you have a local copy of the repository you should see the following directory structure:

Installing R packages

R Packages and dependencies for this project are managed using packrat. This should allow you to install and use the same package versions as we have used for the analysis. packrat should automatically take care of this process for you the first time that you open R in the project directory. If for some reason this does not happen you may need to run the following commands:

install.packages("packrat")
packrat::restore()

Note that a clean install of all the required packages can take a significant amount of time when the project is first opened.

Setting up Python environments

The PAGA and cell velocity parts of the analysis require the scanpy and velocyto Python packages. I have used conda to manage environments for these packages. If you have conda installed you can set up these environments by running the following commands:

conda env create -f env-scanpy.yml
conda env create -f env-velocyto.yml

The environments can then be activated using conda:

# To use scanpy
conda activate scanpy
# To use velocyto
conda activate velocyto

Getting the data

In this project I have used the first batch of kidney organoid samples included in GEO accession number GSE114802. The GEO entry contains processed expression matrices from Cell Ranger but for this analysis I started with the raw FASTQ files which can be downloaded from SRA accession SRP148773. Some pre-processing of the dataset was done on the command line to produce datasets in a form suitable for statistical analysis in R. These steps are described on the methods page and examples of commands for these steps are provided in the scripts directory. If you don’t want to perform these steps yourself you can download the processed data from this Figshare repository. This repository also contains intermediate files from the statistical analysis.

Once the processed data has been has been produced or downloaded it needs to be placed in the correct location. The analysis code assumes the following directory structure inside the data/ directory:

Running the analysis

The analysis directory contains the following analysis files:

As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr.

workflowr::wflow_build(republish = TRUE)

It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis also assume that multiple (up to 10) cores are available for processing. If you have fewer cores available you will need to change the following line in the relevant files and provide the number of cores that are available for use.

bpparam <- MulticoreParam(workers = 10)

It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build() or by manually knitting the document (for example using the ‘Knit’ button in RStudio).

Caching

To avoid having to repeatably re-run long running sections of the analysis I have turned on caching in the analysis documents. However, this comes at a tradeoff with disk space, useability and (potentially but unlikely if careful) reproducibility. In most cases this should not be a problem but it is something to be aware of. In particular there is a incompatibilty with caching and workflowr that can cause images to not appear in the resulting HTML files (see this GitHub issue for more details). If you have already run part of the analysis (and therefore have a cache) and want to rerun a document the safest option is the use the RStudio ‘Knit’ button.


This reproducible R Markdown analysis was created with workflowr 1.1.1