Beckstette, Michael: Index-based algorithms for motif search and their integration in a system for differential genome analysis. 2007
Inhalt
- Introduction
- Modeling concepts for sequence motifs and consensi
- Basic definitions and nomenclature
- Motifs, domains, and sequence families
- Motif finding
- Regular expressions as motif descriptors
- Position specific scoring matrices
- From alignment blocks to PSSMs
- Sequence weighting procedures
- Basic PSSM construction principles
- PSSMs based on odds ratios
- Average score methods
- Explicit log-odd score methods
- Construction of amino acid PSSMs in the BLOCKS database
- Wu's minimal risk scoring matrices
- Construction of nucleotide PSSMs in the TRANSFAC database
- Gribskov's profile model
- Hidden Markov models
- Foundations of hidden Markov model theory
- Profile hidden Markov models
- Profile HMM collections for sequence annotation and classification
- Concluding remarks on sequence motif models
- Fast algorithms for matching position specific scoring matrices
- Introduction
- Pattern matching with PSSMs
- Improved running time through the usage of lookahead scoring
- PSSM searching using suffix trees
- PSSM searching using enhanced suffix arrays: The ESAsearch algorithm
- Further performance improvements via alphabet transformations
- A unifying view on SPsearch, LAsearch, and ESAsearch
- Finding an appropriate threshold for PSSM searching
- Probabilities and expectation values
- Calculation of exact PSSM score distributions
- Evaluation with dynamic programming
- Restricted probability computation
- Lazy evaluation of the permuted matrix
- Threshold independent PSSM matching: The k-best algorithm
- Implementation and computational results
- PoSSuM software distribution
- Discussion and concluding remarks
- PSSM family models for sequence family classification
- Increasing the expressiveness of PSSM-based database searches
- Using multiple ordered PSSMs for sequence classification
- PSSM family models
- Integration of PSSM family models into PoSSuMsearch
- Performance of PSSM family models for protein family classification
- Employed data set and evaluation scenarios
- Model construction and scoring
- Performance evaluation and results
- The significance of PSSM chain scores
- Accelerating HMM based database searches with PSSM family models
- Model specific trusted- and noise cutoffs
- PSfamSearch: Search space reduction with PSSM family models
- Evaluation and computational results
- Cutoff calibration strategies
- Discussion and concluding remarks on performed experiments
- Genlight - a system for interactive, high-throughput, differential genome analysis
- Motivation
- Requirement definitions and design goals
- System architecture and implementation
- Concepts and functionality
- The set oriented concept
- Operations on Seq-sets and Hit-sets
- Integrated sequence analysis methods
- Integrated protein domain and family databases
- Supported protein classification schemes
- Gene ontologies: a unifying vocabulary for cross database queries
- User defined sequence databases
- Asynchronous distributed execution of sequence analysis tasks
- Database schema
- The internal sequence identifier concept
- The handiness of the set oriented concept
- More complex queries using computed sequence attributes
- Genlight as a data warehouse
- The Genlight user interface
- Genlight case studies
- Detection and analysis of the Smh gene family in maize
- Analysis of Xenopus laevis expressed sequence tag clusters
- Identification of potential drug targets in Helicobacter pylori
- Concluding remarks on Genlight
- Conclusions and prospects
- Appendix
- The 20 letter amino acid alphabet
- PROSITE pattern entry
- PoSSuMsearch command line interface: Quick reference
- The PoSSuM software distribution
- File formats
- PoSSuMsearch
- PoSSuMdist
- PoSSuMfreqs
- PSSM converters
- Using the PoSSuM software distribution
- Messages and warnings
- Predefined Hit-set filters in the Genlight system
- Bibliography
