fullscreen
timer
qrcode
plickers
selector
edit
reset

Dynamic Programming

COS 320 - Algorithm Design

Dynamic Programming

Introduction

algorithmic paradigms

Greedy
build up a solution incrementally, myopically optimizing some local criterion.
Divide-and-Conquer
Break up a problem into independent subproblems, solve each subproblem, and combine solution to subproblems to form solution to original problem.
Dynamic Programming
Break up problem into a series of overlapping subproblems, and build up solutions to larger and larger subproblems.

Note

"Dynamic Programming" is a fancy name for caching away intermediate results in a table for later reuse.

dynamic programming history

Bellman: Pioneered the systematic study of dynamic programming in 1950s

Etymology:

dynamic programming applications

Areas:

dynamic programming applications

Some famous dynamic programming algorithms:

Dynamic Programming

Fibonacci Sequence

Fibonacci sequence

The Fibonacci Sequence is defined recursively as

\[ \mathrm{Fib}(n) = \begin{cases} 0 & n = 0 \\ 1 & n = 1 \\ \mathrm{Fib}(n - 1) + \mathrm{Fib}(n - 2) & n \geq 2 \end{cases} \]

The sequence starts with

\[ 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, \ldots \]

Notes:

How to implement this?

Fibonacci sequence

A brute-force, naïve implementation might look like

Fib(n):
    If n == 0
        Return 0
    Else If n == 1
        Return 1
    Else
        Return Fib(n - 1) + Fib(n - 2)

What is the runtime?

Can we improve the runtime?

Fibonacci sequence

\[\mathrm{Fib}(n) = \mathrm{Fib}(n-1) + \mathrm{Fib}(n-2)\]

If we exand this out... \[\begin{eqnarray} \mathrm{Fib}(n) & = & \mathrm{Fib}(n-1) + \mathrm{Fib}(n-2) \\ & = & \overbrace{\mathrm{Fib}(n-2) + \mathrm{Fib}(n-3)}^{\mathrm{Fib}(n-1)} + \overbrace{\mathrm{Fib}(n-3) + \mathrm{Fib}(n-4)}^{\mathrm{Fib}(n-2)} \\ & = & \overbrace{\mathrm{Fib}(n-3) + \mathrm{Fib}(n-4)}^{\mathrm{Fib}(n-2)} + \overbrace{\mathrm{Fib}(n-4) + \mathrm{Fib}(n-5)}^{\mathrm{Fib}(n-3)} + \\ & & + \overbrace{\mathrm{Fib}(n-4) + \mathrm{Fib}(n-5)}^{\mathrm{Fib}(n-3)} + \overbrace{\mathrm{Fib}(n-5) + \mathrm{Fib}(n-6)}^{\mathrm{Fib}(n-4)} \\ \end{eqnarray}\]

Observation: Each expand of \(\mathrm{Fib}(m) = \mathrm{Fib}(m-1) + \mathrm{Fib}(m-2)\), we double the number of terms. However, we are recomputing the same terms multiple times! (ex: \(\mathrm{Fib}(n-4)\) on last line shows up three times)

Fibonacci sequence

If we compute the value of \(\mathrm{Fib}(m)\) for some \(m\) and then remember it, then we can simply return the previously computed value the next time we are asked for \(\mathrm{Fib}(m)\) again.

There are two ways to do this: top-down and bottom-up.

Note

This is similar to the top-down and bottom-up versions of Merge Sort!

Fibonacci sequence

A top-down method still uses recursion but remembers answers to \(\mathrm{Fib}(m)\).

Top-Down-Fib(n):
    For j = 1 to n
        M[j] <- empty
    M[0] <- 0
    M[1] <- 1
    Return Compute-Top-Down-Fib(n)

Compute-Top-Down-Fib(n):
    If M[n] is empty
        M[n] = Compute-Top-Down-Fib(n - 1) + Compute-Top-Down-Fib(n - 2)
    Return M[n]

Fibonacci sequence

A bottom-up method remembers only the two previous answers, \(\mathrm{Fib}(m-2)\) and \(\mathrm{Fib}(m-1)\), to compute the next answer \(\mathrm{Fib}(m)\).

Bottom-Up-Fib(n):
    a <- 0
    b <- 1
    For j = 2 to n
        c <- a + b
        a <- b
        b <- c
    return b

This method iteratively builds up to the final solution and does not use recursion.

Fibonacci sequence

Both methods (top-down and bottom-up) compute the final solution, but they do so in \(O(n)\) time instead of \(O(2^n)\).

Dynamic Programming

weighted interval scheduling

weighted interval scheduling

Weighted interval scheduling problem

           a            :   :   :   :   :   :   
:        b      :   :   :   :   :   :   :   :   
:   :   :      c    :   :   :   :   :   :   :   
:   :   :            d          :   :   :   :   
:   :   :   :        e      :   :   :   :   :   
:   :   :   :   :          f        :   :   :   
:   :   :   :   :   :          g        :   :   
:   :   :   :   :   :   :   :        h      :   
0   1   2   3   4   5   6   7   8   9   10  11  

earliest-finish-time first algorithm

Earliest finish-time first:

Recall: Greedy algorithm is correct if all weights are 1

Observation: Greedy algorithm fails spectacularly for weighted version

                          b (999)                           :     :     
      a (1)       :     :     :     :     :     :     :     :     :     
:     :     :     :     :     :     :     :           h (1)       :     
0     1     2     3     4     5     6     7     8     9     10    11    

weighted interval scheduling

Notation: Label jobs by finishing time: \(f_1 \leq f_2 \leq \ldots \leq f_n\)

Def: \(p(j)\) is largest index \(i<j\) such that job \(i\) is compatible with \(j\)

Ex: \(p(8) = 5, p(7) = 3, p(2) = 0\)

:        1      :   :   :   :   :   :   :   :   
:   :   :      2    :   :   :   :   :   :   :   
           3            :   :   :   :   :   :   
:   :   :   :        4      :   :   :   :   :   
:   :   :          5        :   :   :   :   :   
:   :   :   :   :          6        :   :   :   
:   :   :   :   :   :          7        :   :   
:   :   :   :   :   :   :   :        8      :   
0   1   2   3   4   5   6   7   8   9   10  11  

dynamic programming: binary choice

Notation: \(\mathrm{OPT}(j)\) is value of optimal solution to the problem consisting of job requests \(1, 2, \ldots, j\)

Case 1\(^*\): \(\mathrm{OPT}\) selects job \(j\)

Case 2\(^*\): \(\mathrm{OPT}\) does not select job \(j\)

\[ \mathrm{OPT}(j) = \begin{cases} 0 & \text{if } j = 0 \\ \max \{ v_j + \mathrm{OPT}(p(j)), \mathrm{OPT}(j-1) \} & \text{otherwise} \end{cases} \]

\(^*\)optimal substructure property (proof via exchange argument)

weighted interval scheduling: brute force

\[ \mathrm{OPT}(j) = \begin{cases} 0 & \text{if } j = 0 \\ \max \left\{\begin{array}{l} v_j + \mathrm{OPT}(p(j)), \\ \mathrm{OPT}(j-1) \end{array}\right\} & \text{otherwise} \end{cases} \]


Brute-Force(n, s1, ..., sn, f1, ..., fn, v1, ..., vn):
    Sort jobs by finish time so that f[1] <= f[2] <= ... <= f[n]
    Compute p[1], p[2], ..., p[n]
    Return Compute-Opt(n)

Compute-Opt(j):
    If j == 0
        Return 0
    Else
        Return max{
            v[j] + Compute-Opt(p[j]),
            Compute-Opt(j-1)
        }

weighted interval scheduling: brute force

Observation: Recursive algorithm fails spectacularly because of redundant subproblems ⇒ exponential algorithm

Ex: Number of recursive calls for family of "layered" instances grows like Fibonacci sequence

     1      :   :   :   :   :   :   :   :   :   
:   :        2      :   :   :   :   :   :   :   
:   :   :   :        3      :   :   :   :   :   
:   :   :   :   :   :        4      :   :   :   
:   :   :   :   :   :   :   :        5      :   
0   1   2   3   4   5   6   7   8   9   10  11  

\[p(1)=0, p(j)=j-2\] \[\mathrm{OPT}(j) = \max \left\{ \begin{array}{l} v_j + \mathrm{OPT}(p(j)), \\ \mathrm{OPT}(j-1) \end{array} \right\} \]

weighted interval scheduling: memoization

Memoization: Cache results of each subproblem; lookup as needed

Top-Down(n, s1, ..., sn, f1, ..., fn, v1, ..., vn):
    Sort jobs by finish time so that f[1] <= f[2] <= ... <= f[n]
    Compute p[1], p[2], ..., p[n]
    For j = 1 to n
        M[j] <- empty
    M[0] <- 0
    Return M-Compute-Opt(n)

M-Compute-Opt(j):
    If M[j] is empty
        M[j] <- max{
            v[j] + M-Compute-Opt(p[j]),
            M-Compute-Opt(j-1)
        }
    Return M[j]

weighted interval scheduling: running time

Claim: Memoized version of algorithm takes \(O(n \log n)\) time

Remark: \(O(n)\) if jobs are presorted by finish times

weighted interval scheduling: finding a soln

Q: Dynamic Programming (DP) algorithm computes optimal value. How to find a solution itself?

A: Make a second pass

Find-Solution(j)
    If j = 0
        Return {}
    Else If v[j] + M[p[j]] > M[j-1]
        Return Union({ j }, Find-Solution(p[j]))
    Else
        Return Find-Solution(j-1)

Analysis: number of recursive calls \(\leq n\) ⇒ \(O(n)\)

weighted interval scheduling: bottom-up

Bottom-up dynamic programming: Unwind recursion

Bottom-Up(n, s1, ..., sn, f1, ..., fn, v1, ..., vn):
    Sort jobs by finish time so that f[1]<=f[2]<=...<=f[n]
    Compute p[1], p[2], ..., p[n]
    M[0] <- 0
    For j = 1 to n
        M[j] <- max { v[j] + M[p[j]], M[j-1] }

group: find the weighted interval schedule

Use the dynamic programming solution to solve the weighted interval schedule.

Bottom-Up(n, s1, ..., sn, f1, ..., fn, v1, ..., vn):
    Sort jobs by finish time so that f[1]<=f[2]<=...<=f[n]
    Compute p[1], p[2], ..., p[n]
    M[0] <- 0
    For j = 1 to n
        M[j] <- max { v[j] + M[p[j]], M[j-1] }
:       :       :        a (29) :       :       :       :       :       :       
                     b (29)                     :       :       :       :       
:                c (28)         :       :       :       :       :       :       
:       :        d (68) :       :       :       :       :       :       :       
         e (76)         :       :       :       :       :       :       :       
:       :       :       :       :       :            f (33)     :       :       
:       :       :       :       :                    g (48)             :       
:       :       :       :            h (26)     :       :       :       :       
:       :                    i (3)              :       :       :       :       
:       :       :                        j (71)                 :       :       
1       2       3       4       5       6       7       8       9       0       

Dynamic Programming

segmented least squares

least squares

Least squares: Foundational problem in statistics

\[ \mathrm{SSE} = \sum_{i=1}^n (y_i - a x_i - b)^2 \]

Solution: Calculus ⇒ min error is achieved when

\[ a = \frac{n \sum_i x_i y_i - (\sum_i x_i) (\sum_i y_i)}{n \sum_i x_i^2 - (\sum_i x_i)^2}, b = \frac{\sum_i y_i - a \sum_i x_i}{n} \]

segmented least squares

Segmented least squares:

Q: What is a reasonable choice for \(f(x)\) to balance accuracy (goodness of fit) and parsimony (number of lines)?

segmented least squares

Given \(n\) points in the plane: \((x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\) with \(x_1 < x_2 < \ldots < x_n\) and a constant \(c > 0\), find a sequence of lines that minimizes \(f(x) = E + c L\):

dynamic programming: multiway choice

Notation

To compute \(\mathrm{OPT}(j)\):

segmented least squares algorithm

Segmented-Least-Squares(n, p1, ..., pn, c):
    For j = 1 to n
        For i = 1 to j
            Compute the least squares e(i,j) for the segment pi--pj

    M[0] <- 0
    For j = 1 to n
        Find i in [1,j] that minimizes e(i,j) + c + M[i-1]
        M[j] <- e(i,j) + c + M[i-1]

    Return M[n]

segmented least squares analysis

Theorem: The dynamic programming algorithm solves the segmented least squares problem in \(O(n^3)\) time and \(O(n^2)\) space

\[ a = \frac{n \sum_i x_i y_i - (\sum_i x_i) (\sum_i y_i)}{n \sum_i x_i^2 - (\sum_i x_i)^2}, b = \frac{\sum_i y_i - a \sum_i x_i}{n} \]

Pf:

segmented least squares analysis

Theorem: The dynamic programming algorithm solves the segmented least squares problem in \(O(n^3)\) time and \(O(n^2)\) space

\[ a = \frac{n \sum_i x_i y_i - (\sum_i x_i) (\sum_i y_i)}{n \sum_i x_i^2 - (\sum_i x_i)^2}, b = \frac{\sum_i y_i - a \sum_i x_i}{n} \]

Remark: Can be improved to \(O(n^2)\) time and \(O(n)\) space

Dynamic Programming

knapsack problem

knapsack problem

Knapsack problem

Example: Suppose \(W = 11\) and the weights and values are given in table at right.

Ex: \(\{1,2,5\}\) has value \(v_1 + v_2 + v_5 = 35\).

Ex: \(\{3,4\}\) has value \(v_3 + v_4 = 40\).

Ex: \(\{3,5\}\) has value 46, but exceeds weight limit with \(w_3 + w_5 = 12 > 11\)

\(i\) \(v_i\) \(w_i\)
1 1 1
2 6 2
3 18 5
4 22 6
5 28 7

knapsack problem

Greedy by value: Repeatedly add item with maximum \(v_i\)

Greedy by weight: Repeatedly add item with minimum \(w_i\)

Greedy by ratio: Repeatedly add item with maximum ratio \(v_i / w_i\)


Observation: None of greedy algorithms is optimal

dynamic programming: false start

Def: \(\mathrm{OPT}(i)\) is max profit subset of items \(1, \ldots, i\)

Case 1: \(\mathrm{OPT}(i)\) does not select item \(i\)

Case 2: \(\mathrm{OPT}(i)\) selects item \(i\)

Conclusion: Need more subproblems!

dynamic programming: adding a new var

Def: \(\mathrm{OPT}(i, w)\) is max profit subset of items \(1, \ldots, i\) with weight limit \(w\)

Case 1: \(\mathrm{OPT}(i,w)\) does not select item \(i\) (possibly \(w_i > w\)?)

Case 2: \(\mathrm{OPT}(i,w)\) selects item \(i\)

\[ \mathrm{OPT}(i,w) = \begin{cases} 0 & \text{if } i=0 \\ \mathrm{OPT}(i-1, w) & \text{if } w_i > w \\ \max \left\{\begin{array}{l} \mathrm{OPT}(i-1,w), \\ v_i + \mathrm{OPT}(i-1,w-w_i) \end{array}\right\} & \text{otherwise} \end{cases} \]

knapsack problem: bottom-up

Knapsack(n, W, w1, ..., wn, v1, ..., vn):
    For w = 0 to W
        M[0, w] <- 0

    For i = 1 to n
        For w = 1 to W
            If wi > w
                M[i, w] <- M[i-1, w]
            Else
                M[i, w] <- Max { M[i-1, w], vi+M[i-1, w-wi] }

    Return M[n, W]

\[ \mathrm{OPT}(i,w) = \begin{cases} 0 & \text{if } i=0 \\ \mathrm{OPT}(i-1, w) & \text{if } w_i > w \\ \max \left\{\begin{array}{l} \mathrm{OPT}(i-1,w), \\ v_i + \mathrm{OPT}(i-1,w-w_i) \end{array}\right\} & \text{otherwise} \end{cases} \]

group: knapsack algorithm

\(i\) \(v_i\) \(w_i\)
1 1 1
2 6 2
3 18 5
4 22 6
5 28 7

\[ \mathrm{OPT}(i,w) = \begin{cases} 0 & \text{if } i=0 \\ \mathrm{OPT}(i-1, w) & \text{if } w_i > w \\ \max \left\{ \begin{array}{l} \mathrm{OPT}(i-1,w), \\ v_i + \mathrm{OPT}(i-1,w-w_i) \end{array} \right\} & \text{otherwise} \end{cases} \]

\(\mathrm{OPT}(i,w) =\) max profit subset of
items \(1, \ldots, i\) with weight limit \(w\)

\(i\) subset 0 1 2 3 4 5 6 7 8 9 10 11
0 \(\{ \}\) 0 0 0 0 0 0 0 0 0 0 0 0
1 \(\{ 1 \}\) 0
2 \(\{ 1,2 \}\) 0
3 \(\{ 1,2,3 \}\) 0
4 \(\{ 1,2,3,4 \}\) 0
5 \(\{ 1,2,3,4,5 \}\) 0

knapsack problem: running time

Running time: There exists an algorithm to solve the knapsack problem with \(n\) items and maximum weight \(W\) in \(\Theta(nW)\) time (weights are integers between \(1\) and \(W\))

Dynamic Programming

RNA secondary structure

RNA secondary structure

RNA: String \(B = b_1 b_2 \ldots b_n\) over alphabet \(\{ A, C, G, U \}\)

Secondary structure: RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding the behavior of the molecule.

RNA secondary structure

Secondary Structure: A set of pairs \(S = \{(b_i, b_j)\}\) that satisfy:

  1. Watson-Crick: \(S\) is a matching and each pair in \(S\) is a Watson-Crick complement: \(A{-}U\), \(U{-}A\), \(C{-}G\), \(G{-}C\)
  2. ...




\(S\) is not a secondary structure, because \(C{-}A\) is not a valid Watson-Crick Pair

RNA secondary structure

Secondary Structure: A set of pairs \(S = \{(b_i, b_j)\}\) that satisfy:

  1. Watson-Crick: \(S\) is a matching and each pair in \(S\) is a Watson-Crick complement: \(A{-}U\), \(U{-}A\), \(C{-}G\), \(G{-}C\)
  2. No sharp turns: The ends of each pair are separated by at least 4 intervening bases. If \((b_i,b_j) \in S\), then \(i < j-4\)
  3. ...

\(S\) is not a secondary structure, because \(b_3\) and \(b_7\) are within 4 bases away

RNA secondary structure

Secondary Structure: A set of pairs \(S = \{(b_i, b_j)\}\) that satisfy:

  1. Watson-Crick: \(S\) is a matching and each pair in \(S\) is a Watson-Crick complement: \(A{-}U\), \(U{-}A\), \(C{-}G\), \(G{-}C\)
  2. No sharp turns: The ends of each pair are separated by at least 4 intervening bases. If \((b_i,b_j) \in S\), then \(i < j-4\)
  3. Non-crossing If \((b_i,b_j)\) and \((b_k,b_l)\) are two pairs in \(S\), then we cannot have \(i<k<j<l\)

\(S\) is not a secondary structure, because \(G{-}C\) and \(U{-}A\) cross

RNA secondary structure

Secondary Structure: A set of pairs \(S = \{(b_i, b_j)\}\) that satisfy:

  1. Watson-Crick: \(S\) is a matching and each pair in \(S\) is a Watson-Crick complement: \(A{-}U\), \(U{-}A\), \(C{-}G\), \(G{-}C\)
  2. No sharp turns: The ends of each pair are separated by at least 4 intervening bases. If \((b_i,b_j) \in S\), then \(i < j-4\)
  3. Non-crossing If \((b_i,b_j)\) and \((b_k,b_l)\) are two pairs in \(S\), then we cannot have \(i<k<j<l\)

\(S\) is a secondary structure
(with 3 base pairs)

RNA secondary structure

Secondary Structure: A set of pairs \(S = \{(b_i, b_j)\}\) that satisfy:

  1. Watson-Crick: \(S\) is a matching and each pair in \(S\) is a Watson-Crick complement: \(A{-}U\), \(U{-}A\), \(C{-}G\), \(G{-}C\)
  2. No sharp turns: The ends of each pair are separated by at least 4 intervening bases. If \((b_i,b_j) \in S\), then \(i < j-4\)
  3. Non-crossing If \((b_i,b_j)\) and \((b_k,b_l)\) are two pairs in \(S\), then we cannot have \(i<k<j<l\)

Free-energy hypothesis: RNA molecule will form the secondary structure with the minimum total free energy (approximate by number of base pairs; more base pairs → lower free energy)

Goal: Given an RNA molecule \(B=b_1b_2 \ldots b_n\), find a secondary structure \(S\) that maximizes the number of base pairs.

quiz: dynamic programming

Is the following a secondary structure?


  1. Yes
  2. No, because it violates
    Watson-Crick condition
  3. No, because it violates
    no-sharp-turns condition
  4. No, because it violates
    no-crossing condition

quiz: dynamic programming

Which subproblems?


  1. \(OPT(j)\) = max number of base pairs in secondary structure of the substring \(b_1 b_2 \ldots b_j\)
  2. \(OPT(j)\) = max number of base pairs in secondary structure of the substring \(b_j b_{j+1} \ldots b_n\)
  3. Either A or B
  4. Neither A nor B

RNA secondary structure: subproblems

First attempt: \(OPT(j)\) = maximum number of base pairs in a secondary structure of the substring \(b_1 b_2 \ldots b_j\)

Goal: \(OPT(n)\)

Choice: Match bases \(b_t\) and \(b_j\)

Difficulty: Results in two subproblems (but one of wrong form)

Dynamic programming over intervals

Def: \(OPT(i,j)\) = maximum number of base pairs in a secondary structure of the substring \(b_i b_{i+1} \ldots b_j\)

quiz: dynamic programming

In which order to compute \(OPT(i,j)\)?


  1. Increasing \(i\), then increasing \(j\)
  2. Increasing \(j\), then increasing \(i\)
  3. Either A or B
  4. Neither A nor B

bottom-up dynamic programming over intervals

Q: In which order to solve the subproblems?

A: Do shortest intervals first—increasing order of \(|j-i|\)

RNA-Secondary-Structure(n, b1, ..., bn):
  For k = 5 to n - 1
    For i = 1 to n - k
      j <- i + k
      Compute M[i, j] using formula
      // 3 cases
      // = 0         if i >= j-4
      // = M[i,j-1]  if bj not paired
      // = 1 + max_t( M[i,t-1] + M[t+1,j-1] )
      // (all needed vals are already computed)
  Return M[1, n]
\(i\) \ \(j\) ... 6 7 8 9 10 ...
1 ... ...
2 ... 0 ...
3 ... 0 0 ...
4 ... 0 0 0 ...

Theorem: The DP algorithm solves the RNA secondary structure problem in \(O(n^3)\) time and \(O(n^2)\) space.

dynamic programming summary

Outline

Techniques

Top-down vs Bottom-up dynamic programming: Opinions differ

Dynamic Programming

Dynamic Programming Example Problems

group: maximum subarray problem

Goal: Given an array of \(n\) integers (positive or negative), find a contiguous subarray whose sum is maximum. \[\begin{array} 12 & 5 & -1 & 31 & -61 & 59 & 26 & -53 & 58 & 97 & -93 & -23 & 84 & -15 & 6 \end{array}\]









Applications: Computer vision, data mining, genomic sequence analysis, technical job interviews, ...

group: maximum rectangle problem

Goal: Given an \(n\)-by-\(n\) matrix, find a rectangle whose sum is maximum.

\[ A = \left[\begin{matrix} -2 & 5 & 0 & -5 & -2 & 2 & -3 \\ 4 & -3 & -1 & 3 & 2 & 1 & -1 \\ -5 & 6 & 3 & -5 & -1 & -4 & -2 \\ -1 & -1 & 3 & -1 & 4 & 1 & 1 \\ 3 & -3 & 2 & 0 & 3 & -3 & -2 \\ -2 & 1 & -2 & 1 & 1 & 3 & -1 \\ 2 & -4 & 0 & 1 & 0 & -3 & -1 \end{matrix}\right] \]

Applications: Computer vision, data mining, genomic sequence analysis, technical job interviews, ...

group: coin changing

Problem: Given \(n\) coin denominations \(\{c_1, c_2, \ldots, c_n\}\) and a target value \(v\), find the fewest coins needed to make change for \(v\) (or report impossible).

Recall: Greedy cashier's algorithm is optimal for U.S. coin denominations, but not for arbitrary coin denominations.

Ex: \(\{1,10,21,34,70,100,350,1295,1500\}\)

Optimal: \(140 = 70 + 70\)

Dynamic Programming

Sequence Alignment

String similarity

Q: How similar are two strings?

Ex: ocurrance and occurrence

o c u r r a n c e

o c c u r r e n c e

Both words have letters in common, and letters that are different.

But there are different ways to "match" letters together.

6 mismatches, 1 gap

But there are different ways to "match" letters together.

6 mismatches, 1 gap
1 mismatch, 1 gap

But there are different ways to "match" letters together.

6 mismatches, 1 gap
1 mismatch, 1 gap
0 mismatches, 3 gap

But there are different ways to "match" letters together.

6 mismatches, 1 gap
1 mismatch, 1 gap
0 mismatches, 3 gap

Which is the "best"?

edit distance

Edit distance [Levenshtein 1966, Needleman-Wunsch 1970]

1 gap, 2 mismatches

\[ \text{cost} = \delta + \alpha_{\text{CG}} + \alpha_{\text{TA}} \]

(assuming: \(\alpha_\text{AA} = \alpha_\text{CC} = \alpha_\text{GG} = \alpha_\text{TT} = 0\))

Applications: Bioinformatics, spell correction, machine translation, speech recognition, information extraction, ...

Spokesperson confirms     senior government adviser was found
Spokesperson said     the senior            adviser was found

blosum matrix for proteins

The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.

Wikipedia ]

quiz: dynamic programming

What is the edit distance between these two strings?

P A L E T T E     P A L A T E

Assume gap penalty \(\delta = 2\) and mismatch penalty \(\alpha = 1\)

  1. 1
  2. 2
  3. 3
  4. 4

sequence alignment

Goal: given two strings \(x_1 x_2 \ldots x_m\) and \(y_1 y_2 \ldots y_n\), find a min-cost alignment

Def: An alignment \(M\) is a set of ordered pairs \((x_i,y_j)\) such that each character appears in at most one pair and no crossings. The pairs \((x_i,y_j)\) and \((x_{i'}, y_{j'})\) cross if \(i < i'\) but \(j > j'\).

Def: The cost of an alignment \(M\) is:

\[ \mathrm{cost}(M) = \underbrace{\sum_{(x_i,y_j) \in M} \alpha_{x_iy_j}}_{\text{mismatch}} + \underbrace{ \sum_{i:x_i \text{unmatched}} \delta + \sum_{j:y_j \text{unmatched}} \delta}_{\text{gap}} \]

sequence alignment

Alignment of CTACCG and TACATG

\[ M = \{ (x_2,y_1), (x_3,y_2), (x_4,y_3), (x_5,y_4), (x_6,y_6) \} \]

\(x_1\) \(x_2\) \(x_3\) \(x_4\) \(x_5\) \(x_6\)
C T A C C G
T A C A T G
\(y_1\) \(y_2\) \(y_3\) \(y_4\) \(y_5\) \(x_6\)

sequence alignment: problem structure

Def: \(\mathrm{OPT}(i,j)\) is min cost of aligning prefix strings \(x_1 x_2 \ldots x_i\) and \(y_1 y_2 \ldots y_j\)

Goal: \(\mathrm{OPT}(m,n)\)

(optimal substructure property for each case; proof via exchange argument)

Sequence alignment: Bellman equation

\[ \mathrm{OPT}(i,j) = \begin{cases} j\delta & \text{if } i = 0 \\ i\delta & \text{if } j = 0 \\ \min \left\{\begin{array}{lll} \alpha_{x_iy_j} & + & \mathrm{OPT}(i-1,j-1) \\ \delta & + & \mathrm{OPT}(i-1, j) \\ \delta & + & \mathrm{OPT}(i, j-1) \end{array}\right. & \text{otherwise} \end{cases} \]

sequence alignment: bottom-up algorithm

\[ \mathrm{OPT}(i,j) = \begin{cases} j\delta & \text{if } i = 0 \\ i\delta & \text{if } j = 0 \\ \min \left\{\begin{array}{lll} \alpha_{x_iy_j} & + & \mathrm{OPT}(i-1,j-1) \\ \delta & + & \mathrm{OPT}(i-1, j) \\ \delta & + & \mathrm{OPT}(i, j-1) \end{array}\right. & \text{otherwise} \end{cases} \]

Sequence-Alignment(m, n, x1, ..., xm, y1, ..., yn, delta, alpha)
    For i = 0 to m
        M[i,0] <- i * delta
    For j = 0 to n
        M[0,j] <- j * delta
    For i = 1 to m
        For j = 1 to n
            M[i,j] <- min {
                alpha(xi, yi) + M[i-1,j-1], // already computed M[i-1,j-1]
                delta + M[i-1,j],           // already computed M[i-1,j]
                delta + M[i,j-1]            // already computed M[i,j-1]
            }
    Return M[m,n]

sequence alignment: traceback

\(\delta = 1, \alpha_{pp} = 0, \alpha_{pq} = 1\)

S I M I L A R I T Y
0 1 2 3 4 5 6 7 8 9 10
I 1 1 1 2 3 4 5 6 7 8 9
D 2 2 2 2 3 4 5 6 7 8 9
E 3 3 3 3 3 4 5 6 7 8 9
N 4 4 4 4 4 4 5 6 7 8 9
T 5 5 5 5 5 5 5 6 7 7 8
I 6 6 5 6 5 6 6 6 6 7 8
T 7 7 6 6 6 6 7 7 7 6 7
Y 8 8 7 7 7 7 7 8 8 7 6

4 mismatches, 2 gap

sequence alignment: analysis

Theorem: The DP algorithm computes the edit distance (and an optimal alignment) of two strings of length \(m\) and \(n\) in \(\Theta(mn)\) time and space.

Pf:


Theorem [Backurs-Indyk 2015]: If can compute edit distance of two strings of length \(n\) in \(O(n^{2-\epsilon})\) time for some constant \(\epsilon > 0\), then can solve SAT with \(n\) variables and \(m\) clauses in \(\mathrm{poly}(m) 2^{(1-\delta)n}\) time for some constant \(\delta > 0\) (which would disprove SETH)

quiz: dynamic programming

It is easy to modify the DP algorithm for edit distance to...

  1. Compute edit distance in \(O(mn)\) time and \(O(m+n)\) space

  2. Compute an optimal alignment in \(O(mn)\) time and \(O(m+n)\) space

  3. Both A and B

  4. Neither A nor B

\[ \mathrm{OPT}(i,j) = \begin{cases} j\delta & \text{if } i = 0 \\ i\delta & \text{if } j = 0 \\ \min \left\{\begin{array}{lll} \alpha_{x_iy_j} & + & \mathrm{OPT}(i-1,j-1) \\ \delta & + & \mathrm{OPT}(i-1, j) \\ \delta & + & \mathrm{OPT}(i, j-1) \end{array}\right. & \text{otherwise} \end{cases} \]

Dynamic Programming

Hirschberg's algorithm

sequence alignment in linear space

Theorem [Hirschberg]: There exists an algorithm to find an optimal alignment in \(O(mn)\) time and \(O(m+n)\) space

hirschberg's algorithm

Edit distance graph:

hirschberg's algorithm

Edit distance graph:

Pf of Lemma (by strong induction on \(i+j\)):

hirschberg's algorithm

Pf of Lemma (cont'd):

hirschberg's algorithm

Edit distance graph

hirschberg's algorithm

Edit distance graph

 

hirschberg's algorithm

Edit distance graph

hirschberg's algorithm

Edit distance graph

hirschberg's algorithm

Observation 1: The length of a shortest path that uses \((i,j)\) is \(f(i,j) + g(i,j)\)

hirschberg's algorithm

Observation 2: Let \(q\) be an index that minimizes \(f(q,n/2) + g(q,n/2)\). Then, there exists a shortest path from \((0,0)\) to \((m,n)\) that uses \((q,n/2)\)

hirschberg's algorithm

Divide: Find index \(q\) that minimizes \(f(q,n/2) + g(q,n/2)\); save node \((i,j)=(q,n/2)\) as part of solution.

Conquer: Recursively compute optimal alignment in each piece.

Hirschberg's algorithm: space analysis

Theorem: Hirschberg's algorithm uses \(\Theta(m+n)\) space

Pf:

quiz: dynamic programming

What is the tightest worst-case running time of Hirschberg's algorithm?

  1. \(O(mn)\)

  2. \(O(mn \log m)\)

  3. \(O(mn \log n)\)

  4. \(O(mn \log m \log n)\)

hirschberg's alg: runtime analysis warmup

Theorem: Let \(T(m,n)\) be max running time of Hirschberg's algorithm on strings of lengths at most \(m\) and \(n\). Then, \(T(m,n) = O(mn \log n)\)

Pf:

Remark: Analysis is not tight because two subproblems are of size \((q,n/2)\) and \((m-q,n/2)\). Next, we prove \(T(m,n) = O(mn)\)

hirschberg's alg: runtime analysis

Theorem: Let \(T(m,n)\) be max running time of Hirschberg's algorithm on strings of lengths at most \(m\) and \(n\). Then, \(T(m,n) = O(mn)\)

Pf (by strong induction on \(m+n\)):

hirschberg's alg: runtime analysis

Pf (cont'd):

group: longest common subsequence

Problem: Given two strings \(x_1 x_2 \ldots x_m\) and \(y_1 y_2 \ldots y_n\), find a common subsequence that is as long as possible

Alternative viewpoint: Delete some characters from \(x\); delete some characters from \(y\); a common subsequence if it results in the same string

Ex:

LCS("GGCACCACG", "ACGGCGGATACG") = "GGCAACG"

- - G G C - - A C C - A C G
A C G G C G G A - - T A C G
---------------------------
    G G C     A       A C G

Applications: Unix diff, git, bioinformatics

×