Tools and techniques for single-cell RNA sequencing data
Luke Zappia
Front matter
Abstract
RNA sequencing of individual cells allows us to take a snapshot of the dynamic processes within a cell and explore differences between cell types. As this technology has developed over the last few years it has been rapidly adopted by researchers in areas such as developmental biology, and many single-cell RNA sequencing datasets are now available. Coinciding with the development of protocols for producing single-cell RNA sequencing data there has been a simultaneous burst in the development of computational analysis methods. My thesis explores the computational tools and techniques for analysing single-cell RNA sequencing data. I present a database that charts the release of analysis software, where it has been published and what it can be used for, as well as a website that makes this information publicly available. I also present two of my own tools and techniques including Splatter, a software package for easily simulating single-cell datasets from multiple models, and clustering trees, a visualisation approach for inspecting clustering at multiple resolutions. In the final part of my thesis I perform analysis of a dataset from kidney organoids to demonstrate and compare some current analysis methods. Taken together, my thesis covers many aspects of the tools and techniques for single-cell RNA sequencing by describing the approaches that are available, presenting software that can help in developing and evaluating methods, introducing an approach for aiding one of the most common analysis tasks, and showing how tools can be used to extract meaning from a real dataset.
Declaration
This is to certify that:
this thesis comprises only my original work towards the degree of Doctor of Philosophy except where indicated in the preface;
due acknowledgement has been made in the text to all other material used; and
this thesis is fewer than the 100,000 words in length, exclusive of tables, maps, bibliographies and appendices
Preface
This preface provides a summary of the chapters in this thesis and describes my contribution to them as well as the contributions of my collaborators and supervisors. This is a thesis with publication and where publications form part of a chapter they are listed here. The following publications are included as part of this thesis:
Zappia L, Phipson B, Oshlack A. “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database.” PLoS Computational Biology. 2018. DOI: 10.1371/journal.pcbi.1006245
Zappia L, Phipson B, Oshlack A. “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology. 2017. DOI: 10.1186/s13059-017-1305-0
Zappia L, Oshlack A. “Clustering trees: a visualization for evaluating clusterings at multiple resolutions.” GigaScience. 2018. DOI: 10.1093/gigascience/giy083
These publications are included as they appear online and are designed to be read as stand-alone documents. Sections within these publications are not included in the table of contents, and references are available at the end of each publication rather than in the reference list for this thesis. The contributions of authors to these papers are explained below. I am the first author on these publications and contributed more than 50 percent of the work towards them including drafting, editing and revising the manuscripts. My co-authors have provided signed declarations acknowledging and supporting my contributions which have been submitted along with this thesis. Where publicly available datasets have been used these have been appropriately cited.
Chapter 1: Introduction is an original work providing a background and overview relevant to understanding my work in this thesis including an introduction to RNA sequencing, single-cell RNA sequencing and kidney function and development.
Chapter 2: The scRNA-seq tools landscape is an original work describing a database of software tools for analysing single-cell RNA sequencing data which has been published in PLoS Computational Biology as “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database”. In addition to this publication I developed a website displaying the information in this database which is available at https://www.scRNA-tools.org. The database and code for building the website is available on GitHub at https://github.com/Oshlack/scRNA-tools under an MIT license.
Contributions to the work in this chapter:
I compiled and regularly updated the database of tools.
I designed and built the public website used to display the database. Breon Schmidt provided assistance with implementing some of the website functionality. Some of the code for processing the database was based on a script written by Sean Davis.
I performed the analysis of the database presented in the publication.
I wrote the first draft of the manuscript and produced all the figures in the publication.
Alicia Oshlack provided advice on planning the manuscript and edited draft versions.
Belinda Phipson contributed to writing the manuscript.
Chapter 3: Simulating scRNA-seq data is an original work describing a software package for simulating single-cell RNA sequencing expression data. This work was published in Genome Biology as “Splatter: simulation of single-cell RNA sequencing data”. The software package described in this publication is available through Bioconductor at https://bioconductor.org/packages/splatter and the code is shared on GitHub at https://github.com/Oshlack/splatter under a GPL-3.0 license.
Contributions to the work in this chapter:
I designed and implemented the Splatter R package described in this chapter.
Belinda Phipson contributed to the design of the Splat simulation method described in the publication and provided statistical advice.
I conducted the analysis presented in the publication and produced the figures shown.
Belinda Phipson performed pre-processing for some of the public datasets used.
Alicia Oshlack helped to design and plan the analysis presented in the publication.
I wrote the first draft of the manuscript and performed revisions.
Alicia Oshlack assisted with planning the manuscript and edited drafts.
Belinda Phipson helped write sections of the manuscript and edited drafts.
Jovana Maksimovic and Sarah Blood proofread a draft of the manuscript and provided comments.
Chapter 4: Visualising clustering across resolutions is an original work describing a visualisation for showing clustering results across multiple resolutions and helping to select a clustering resolution to use. This work has been published in GigaScience as “Clustering trees: a visualization for evaluating clusterings at multiple resolutions” and a software package implementing the algorithm described is available from CRAN at https://cran.r-project.org/package=clustree. The source code for this package can be found on GitHub at https://github.com/lazappi/clustree under a GPL-3.0 license.
Contributions to the work in this chapter:
I designed the clustering trees algorithm described in this chapter.
I designed and built the clustree R package that implements this algorithm.
I performed the analysis presented in the publication and designed and produced the figures shown.
Alicia Oshlack provided advice on the design and planning of the analysis to present.
I planned and wrote the first draft of the manuscript.
Alicia Oshlack provided advice on the structure of the manuscript and edited draft versions.
I performed revisions and drafted responses to reviewers.
Marek Cmero read and provided comments on a draft of the manuscript.
Chapter 5: Analysis of kidney organoid scRNA-seq data is an original work where I performed a re-analysis of a previously published single-cell RNA sequencing experiment from kidney organoids in order to demonstrate a range of analysis tools and decisions during analysis.
Contributions to the work in this chapter:
The dataset is publicly available from the Gene Expression Omnibus under accession GSE114802.
I performed pre-processing of the dataset
I designed and performed the analysis with input from Alicia Oshlack, Belinda Phipson, Melissa Little and Alex Combes.
Alex Combes helped with interpreting gene lists describing cell types.
I designed and created the figures shown in this chapter.
Chapter 6: Conclusion is an original work summarising the work in this thesis, placing it in the wider context of single-cell RNA sequencing analysis and outlining potential directions of the field.
Other publications that I have contributed to during my candidature but are not presented in this thesis
Phipson B, Zappia L, Oshlack A. “Gene length and detection bias in single cell RNA sequencing protocols.” F1000 Research. DOI: 10.12688/f1000research.11290.1.
Phipson B, Er PX, Combes AN, Forbes TA, Howden SE, Zappia L, Yen HJ, Lawlor KT, Hale LJ, Sun J, Wolvetang E, Takasato M, Oshlack A, Little MH. “Evaluation of variability in human kidney organoids.” Nature Methods. 2019. DOI: 10.1038/s41592-018-0253-2.
Combes AN+, Zappia L+, Er PX, Oshlack A, Little MH. “Single-cell analysis reveals congruence between kidney organoids and human fetal kidney.” Genome Medicine. 2019. DOI: 10.1186/s13073-019-0615-0
Kumar SV, Er PX, Lawlor KT, Motazedian A, Scurr M, Ghobrial I, Combes AN, Zappia L, Oshlack A, Stanley EG, Little MH. “Kidney micro-organoids in suspension culture as a scalable source of human pluripotent stem cell-derived kidney cells.” Development. 2019. DOI: 10.1242/dev.172361
Combes AN, Phipson B, Lawlor KT, Dorison A, Patrick R, Zappia L, Harvey RP, Oshlack, A, Little MH. “Single cell analysis of the developing mouse kidney provides deeper insight into marker gene expression and ligand-receptor crosstalk.” Development. 2019. DOI: 10.1242/dev.178673
+ Authors contributed equally.
Acknowledgements
Thank you to my supervisors Alicia Oshlack and Melissa Little, it has been a pleasure and a privilege to work with you over the last few years. You both provide supportive environments that let people get the most out of themselves and I am fortunate to have had you to guide me through my PhD.
That support extends to the other members of the Oshlack lab. It’s easy to come into work every day when you get on with your colleagues and enjoy spending time with them. A special thanks to Belinda Phipson who as being an amazing source of advice about statistics and analysis but also just life in general. The Little kidney development lab has also been extremely welcoming. Thank you to all of them for making me feel a part of their group and sharing their knowledge, particularly Alex Combes who it has been fantastic to work closely with.
I would also like to thank the member of my advisory committee Andrew Pask, Christine Wells and Edmund Crampin for their advice and encouragement.
Outside my work a big part of my PhD has been my involvement with COMBINE. Thank you to everyone who has contributed to building and continuing to develop the organisation, particularly Leah Roberts who has guided me through everything. The Australian bioinformatics student community is extremely fortunate to have such an organisation and I hope students and supervisors continue to support COMBINE and make it even better.
Thank you to my friends and family for the support you have provided outside my PhD, it’s important to have other things going on in the rest of my life. The biggest thanks to my partner Sarah for her continued love and care (and proofreading expertise).
Lastly I want to thank the wider scientific and programming communities. Everyone who makes their code, software, data and analysis available makes this kind of research possible, easier and more fun.
List of copyright
The copyright of the publication “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database” included in Chapter 2 is held by the authors including myself under a Creative Commons Attribution License (CC BY 4.0) (https://creativecommons.org/licenses/by/4.0/). This license permits unrestricted use, distribution and reproduction in any medium, provided the original authors and source are credited.
The copyright of the publication “Splatter: simulation of single-cell RNA sequencing data” included in Chapter 3 is held by the authors including myself under a Creative Commons Attribution License (CC BY 4.0). The additional files for this publication provided in Appendix A are covered by the same license.
The copyright of the Splatter R software package and the documentation included in Appendix B is held by the authors including myself under a GNU General Public License v3.0 (GPL-3) license (https://choosealicense.com/licenses/gpl-3.0/). This license allows me to reproduce the documentation in both print and electronics versions of this thesis.
The copyright of the publication “Clustering trees: a visualization for evaluating clusterings at multiple resolutions” included in Chapter 4 is held by the authors including myself under a Creative Commons Attribution License (CC BY 4.0).
The copyright of the clustree R software package and the documentation included in Appendix D is held by the authors including myself under a GNU General Public License v3.0 (GPL-3) license.
The copyright of the publication “Single-cell analysis reveals congruence between kidney organoids and human fetal kidney” included in Appendix E is held by the authors including myself under a Creative Commons Attribution License (CC BY 4.0).