Huang, Liren: Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. 2019
Inhalt
- Titlepage
- Abstract
- Acknowledgement
- 1 Introduction
- 2 Related Work
- 2.1 The Apache Hadoop and Spark frameworks
- 2.2 Sequence alignment and its cloud implementations
- 2.2.1 Short read alignment and fragment recruitment
- 2.2.2 Algorithms for sequence alignment
- 2.2.3 Distributed implementations
- 2.3 De novo assembly and its cloud implementations
- 2.3.1 Algorithms for short read de novo assembly
- 2.3.2 State-of-the-art de Bruijn graph
- 2.3.3 Cloud based de novo assemblers
- 2.4 Conclusion
- 3 Sparkhit: Distributed sequence alignment
- 3.1 The pipeline for sequence alignment
- 3.1.1 Building reference index
- 3.1.2 Candidate block seraching and q-Gram filters
- 3.1.3 Pigeonhole principle
- 3.1.4 Banded alignment
- 3.2 Distributed implementation
- 3.2.1 Reference index serialization and broadcasting
- 3.2.2 Data representation in the Spark RDD
- 3.2.3 Concurrent in memory searching
- 3.2.4 Memory tuning for Spark native implementation
- 3.3 Using external tools and Docker containers
- 3.4 Integrating Spark's machine learning library (MLlib)
- 3.5 Parallel data preprocessing
- 3.6 Results and Discussion
- 3.6.1 Run time comparison between different mappers
- 3.6.2 Scaling performance of Sparkhit-recruiter
- 3.6.3 Accuracy and sensitivity of natively implemented tools
- 3.6.4 Fragment recruitment comparison with MetaSpark
- 3.6.5 Preprocessing comparison with Crossbow
- 3.6.6 Machine learning library benchmarking and run time performances on different clusters
- 3.6.7 Cluster configurations for the benchmarks
- 3.6.8 NGS data sets for the benchmarks
- 3.6.9 Discussion
- 4 Reflexiv: Parallel De Novo genome assembly
- 4.1 Reflexible Distributed K-mer (RDK)
- 4.2 Random k-mer reflecting and recursion
- 4.3 Distributed implementation
- 4.4 Repeat detection and bubble popping
- 4.5 The assembly pipeline
- 4.6 Time complexity
- 4.7 Memory consumption
- 4.8 Results and Discussion
- 5 Large scale genomic data analyses
- 5.1 Cluster deployment and configuration
- 5.2 Data storage and accessibility
- 5.3 Distributed data downloading and decompression
- 5.4 Rapid NGS data analyses on the AWS cloud
- 5.4.1 Processing all WGS data of the Human Microbiome Project
- 5.4.2 Genotyping on 3000 samples of the 3000 Rice Genomes Project
- 5.4.3 Mapping 106 samples of the 1000 Genomes Project
- 5.4.4 Gene expression profiling on prostate cancer RNA-seq data
- 5.5 Metagenomic profiling and functional analysis
- 5.6 Discussion
- 6 Conclusion and outlook
- Bibliography
- Colophon
- Declaration
