Algorithms on Strings, Trees, and Sequences
CSCI 381, Spring 2023
This course presents deterministic algorithms that operate on strings, trees, and sequences. String algorithms are used in many areas: information retrieval, text-editing, computational linguistics - to name a few. In this course we discuss the algorithms in context of their application to biological sequence analysis.
Prerequisites: Data Structures, Algorithms.
Below you find the materials related to the course arranged according to the timeline. For up-to-date course announcements and information please refer to the blackboard course page.
Calendar
Week 1 | Feb 1 | Introduction and Setup. Course mechanics. Lecture 0.
Strings of life I. Historical perspective on definition of life. Pre-genomic period. Lecture 1.1. READModern definition of life. |
Feb 3 | Strings of life II.
Genomic period. Genome assembly problem. Lecture 1.2.
Post-genomic period. Some speculations. Lecture 1.3. READMolecular Biology. TASK Reading Assignment. Paper to read: link. Due: Feb 8. TASK Quiz 1. Biosequences. |
|
Week 2 | Feb 8 | Pattern Matching I. Task of exact pattern matching.
Algorithm by Knuth, Morris and Pratt (KMP). Complexity. Lecture 2.1.
DEMO Shifting Heuristics: LINK. READ G*: Chapter 2.3. |
Feb 10 | Pattern Matching II.
Overlap function in linear time. Lecture 2.2.
READ G*: Chapter 2.3.2. TASK Assignment 1: implementing KMP. Due: Feb 15. |
|
Week 3 | Feb 15 | Pattern Matching review. Quiz and activities.
TASK Assignment 2: pattern search. Due: Feb 22. |
Feb 17 | Suffix trees I.
Introduction to Suffix Trees. Pattern search. Lecture 3.1.
READ G*: Chapter 5. |
|
Week 4 | Feb 23 | Suffix Trees II. Applications of Suffix Trees. Finding repeats.
Lecture 3.2.
The Longest Common Substring in linear time. Lecture 3.3.
READ G*: Chapters 7.1-7.6 and 7.11-7.12. TASK Assignment 3: suffix trees. Due: March 1. |
Feb 25 | Suffix Arrays.
Introduction to Suffix Arrays - a space-efficient alternative to suffix trees.
Lecture 4.1.
READ Chapter about Suffix Arrays from this book. |
|
Week 5 | Mar 1 | Suffix Array construction. Building suffix arrays in time O(n log n). Algorithm by Larsson and Sadakane.
Lecture 4.2.
READ the original paper TASK Assignment 4: suffix arrays. Due: March 8. |
Mar 3 | FM-indexes.
Burrows-Wheeler transform. Compressed self-indexes. FM-index.
Lecture 4.3.
READ blog. |
|
Week 6 | Mar 8 | Dynamic Programming. Shortest paths in a grid graph. Recursion vs. Dynamic Programming.
Edit distance between two strings.
Lecture 5.1.
READ G*: Chapters 11.1 - 11.5. TASK Assignment 5: Dynamic Programming. Due: March 15. |
Mar 10 | String similarity. Edit graph.
Edit distance vs. Longest Common Subsequence. Global and Local alignment.
Lecture 5.2.
READ G*: Chapters 11.6 - 11.9 |
|
Week 7 | Mar 15 | Edit Distance in linear space. Algorithm by Hischberg.
Lecture 5.3.
READ G*: Chapter 12.1. |
Mar 17 | Faster Edit Distance. Algorithm by Miller and Myers.
Lecture 5.4.
Original paper: link.
READ G*: Chapter 12.2. |
|
Week 8 | Mar 22,24 | Spring break: no classes |
Week 9 | Mar 29 | Applications of String Searching algorithms. Bio-sequence databases and their uses.
String searching algorithms: summary. Lecture 5.5.
READ G*: Chapter 15. TASK Quiz 8 . Due: Mar 31. TASK Assignment 6: Miller-Myers in linear space . Due: April 12. |
Mar 31 | Multiple Sequence Alignment.
Motivation for comparing multiple strings. Molecular evolution.
Discovering common biological functions. Multiple sequence alignment problem. Dynamic programming solution. Intractability.
Approximation algorithm: SP-star.
Lecture 6.1.
READ G*: Chapter 14, D*: Chapters 6.1 - 6.4. TASK Quiz 9 . Due: Apr 4. |
|
Week 10 | Apr 5 | Parsimony and perfect phylogeny. Change through evolution. Phylogenetic trees.
Parsimony principle. Algorithm for building perfect phylogenies (Gusfield Chapter 17.3).
Lecture 6.2.
READ G*: Chapter 17. TASK Quiz 10 . Due: Apr 6. |
Apr 7 | Character-based Phylogenies.
Parsimony of mutational events. Small and large parsimony problems. The Fitch algorithm for the Small Parsimony Problem.
Optimization for the Large Parsimony Problem: branch-and-bound.
Lecture 6.3.
READ D*: Chapter 7. |
|
Week 11 | Apr 12 | Distance-based Phylogenies. Hierarchical clustering. UPGMA.
Additivity. Ultrametric trees. Molecular clock.
Lecture 6.4.
READ D*: Chapter 7. TASK Assignment 7. Phylogenetic trees . Due: Apr 21. |
Apr 14 | Statistics: primers.
Conditional probabilities. Bayesian reasoning.
Lecture 7.1.
Markov models. Markov chains. Lecture 7.2. DEMO Markov models and equilibrium: markov.py. DEMO Casino sequences: casino.py. |
|
Week 12 | Apr 19 | Hidden Markov Models (HMM). Honest and dishonest casino. Bayes method for discrimination between two model states.
Occasionally dishonest casino. Viterbi algorithm for computing most probable path through states.
HMM parameter estimation.
Lecture 7.3.
READ D*: Chapter 3. TASK Quiz 11. HMM . Due: Apr 20. |
Apr 21 | Applications of HMM.
Bio-sequence applications: gene hunting (CpG islands) and profile alignments.
Lecture 7.4.
READ D*: Chapter 3. |
|
Week 13 | Apr 26 | Artificial Neural Networks (ANN). ANN primer. Multi-layer Perceptron.
Importance of non-linearity.
Lecture 8.1.
DEMO Perceptron, Multi-layer Perceptron, and a sample application: Link. READ The 100 Page ML book: Chapter 6. Grokking Deep Learning: Chapters 1-6. |
Apr 28 | Applications of Neural Networks to sequential data.
Main ideas behind Convolutional Neural Networks (CNNs). Sample applications: image recognition and sequence classification.
Basics of Recurrent Neural Networks (RNNs).
Lecture 8.2.
DEMO Sample applications of CNNs: Link. READ The 100 Page ML book: Chapter 6.2.1. Recurrent neural networks: Chapter 6.2.2. Explanation of recurrent neural networks on a real-life example plus demo: Link. |
|
*G refers to the
Gusfield book.
*D refers to the Durbin book. |