Class 22

Applications: Sequence alignment

Objectives for today

  • Develop a recursive algorithm to align two DNA sequences
  • Use memoization to improve efficiency of algorithms with overlapping sub-problems
  • Apply your program to identifying a genetic mutation in real sequencing data

Sequence alignment

We can learn a lot from aligning biological sequences, i.e., DNA or protein sequences. We can infer information about an unknown sequence from the regions of similarity and difference with a known sequence(s). Our goal is to identify the alignment of two sequences, e.g., the strings “TGTCC” and “TGATC” shown below, with gaps that maximizes a biological-relevant scoring metric. That scoring metric often assigns a positive score for matching letters and a negative score for mismatches and gaps (i.e., introducing a “-” into one of the sequences).

TG-TCC
TGATC-

A recursive formulation

A naive approach of trying all possible alignments of strings of length n and m has an impossible time complexity of \(O(\frac{(n+m)!}{n!m!})\). Fortunately we can do much better! A key insight is that we can express the alignment recursively, i.e., the optimal alignment of two sequences is the optimal alignment of two shorter sequences plus the alignment of the last letters. Specifically, we can express the score of the optimal alignments of strings of length \(i\) and \(j\) as:

\[ S_{i, j} = \max \begin{cases} S_{i-1, j-1} + \text{match}(i, j) \\ S_{i-1, j} + \text{gap} \\ S_{i, j-1} + \text{gap} \end{cases} \]

where \(\text{match}(i, j)\) is the score of aligning the \(i\)th letter of the first sequence with the \(j\)th letter of the second sequence, and \(\text{gap}\) is the score of introducing a gap. The base case is \(S_{0,0} = 0\).

Achieving quadratic time performance

The recursive formulation alone is not efficient. It will recompute the score of the optimal alignment of the same subsequences many times. What observe about this problem is that solution is expressed not just in terms of smaller versions of the same problem (termed sub-problems), but in terms of overlapping sub-problems, i.e., the same smaller version of the problem multiple times. Dynamic programming approaches improve the performance of these algorithms by only solving each sub-problem once! We can do so in our case via “memoization”, i.e., storing the score of the optimal alignment of each pair of subsequences when we compute it. If we encounter that problem again, we can simply look up the score! With this approach we can achieve a time complexity of \(O(nm)\), where \(n\) and \(m\) are the lengths of the two sequences!

Download the starter file to implement memoization. To see the impact, we added a decorator to track the recursive calls. After execution, e.g., of global_align, with the calls attribute, e.g., global_align.calls. After implementing memoization, compare the call counts between global_align and global_align.memo. You can check out the full implementation here.

Extracting the alignments

Our functions so far have computed the score and the edges we “traversed” to compute the score at each node at the lattice, but not the alignment itself. To obtain the actual alignment, we “backtrack” from the sink node to the source node, following the edges the make up the optimal alignment. Recall that each edge is associated with a specific alignment: diagonal edges align two letters, vertical or horizontal edges align letters with gaps. We use that mapping to generate the aligned strings.

Using our program for aligning sequencing data

Short-read sequencing (SRS) is one of the most commonly used technologies for sequencing the genome, i.e., inferring the genome sequence. It works by fragmenting the DNA into many pieces of approximately 500 nucleotides in length, and then sequencing 100-150 nucleotides of both ends of those fragments. Thus the reads (each fragment yields two reads) are relatively short (hence the name). The first step is to align those fragments back to the reference genome to infer where in the genome they originated. This is conceptually the same problem we solved above, with one difference. One sequence is very long, the reference genome, and one is very short, the read sequence. The short sequence will necessarily have gaps and the beginning and the end that should not be penalized. We describe this version of the alignment problem as “fitting” alignment (we are finding where the shorter string fits in the longer string).

To eliminate those penalties, we add additional “free” or “taxi” edges into the lattice, that have 0 scores. Specifically, we can jump from the source to any node in the first column and to the sink from any node in the last column with no penalty. The result is we align the shorter sequence, in its entirety, at the best location within the longer sequence.

To implement this alternate algorithm, we make the following changes:

  • Change the recursive case for the first column to compute the max of the existing vertical edge, and 0, i.e., the score will never be less than 0 in the first column
  • Choose the “sink” to be the highest scoring node in the right column instead of the bottom right corner

With those in modifications in place, we can now implement fitting_alignment function to generate the fitting alignment of two sequences:

def fitting_alignment(ref_seq, query_seq):
    scores = {}
    fit_align_back(ref_seq, query_seq, len(ref_seq), len(query_seq), scores)
    
    ref_end, query_end = find_fit_start( len(ref_seq), len(query_seq), scores)
    assert query_end == len(query_seq), "Fitting alignment should end in last column"
    
    ref_align, query_align = backtrack(ref_seq, query_seq, ref_end, query_end, scores)
    # How many gaps at end of query_seq?
    tail = len(ref_seq) - ref_end
    return (scores[(ref_end, query_end)][0], ref_align + ref_seq[ref_end:], query_align + "-" * tail)

In the following notice that the shorter string “TAGATA” aligns completely within the longer string “TTAGTAGGCTTAAGGTTA”:

>>> fitting_alignment("TTAGTAGGCTTAAGGTTA", "TAGATA")
(4, 'TTAG-TAGGCTTAAGGTTA', '-TAGATA------------')

Check out the complete program in alignment.py. It includes an additional function pileup for generating the pileup visualization (shown below for a set of reads in NA12878 for a region on chromosome 7). In the pileup, the genome coordinates are across the top. The first line is the human reference genome, the remaining rows are the individual sequencing reads which provide evidence for this individual true genome sequence in this region. Each column represents the data we have for the individual’s genome at that location. Notice that most columns are the same from top to bottom. But one, towards the middle, has a mix of Ts and Cs. This indicates that this person has a heterozygous variant at this location, that is they inherited a “T” from one parent, and “C” from the other. Recall that humans are diploid organisms, that we is we inherit one copy of our genome from our mother and one from our father. Thus, we can have two different sequences (alleles) at the same location in the genome.

The pileup we created shows a known variant in NA12878. You can check out this variant in more detail with the MySeq tool. Enter chr7:141672604 into the search box(chromosome 7, position 141672604) and click on the variant that is listed.

This is a “well-known” variant associated with tasting certain chemicals (e.g., cabbage, raw broccoli, coffee and dark beer) as bitter. If you have the variant allele you likely taste those chemicals and foods as bitter, if you don’t, those foods may taste less bitter to you.

AATCTGCCTTGTGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATGAAGGCAGC
--TCTGCCTTGTGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGC---------------------------------------------------------------------------------------------------
---CTGCCTTGTGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCT--------------------------------------------------------------------------------------------------
---CTGCCTTGTGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACGGCTCTCCTCAACGTGGCATTGCCTGAGATCAGGACGGCT--------------------------------------------------------------------------------------------------
---------TGTGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGC--------------------------------------------------------------------------------------------
-----------TGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCC------------------------------------------------------------------------------------------
-----------TGGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCC------------------------------------------------------------------------------------------
------------GGTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCA-----------------------------------------------------------------------------------------
-------------GTCGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAG----------------------------------------------------------------------------------------
---------------CGGCTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAG--------------------------------------------------------------------------------------
------------------CTCTTACCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGA-----------------------------------------------------------------------------------
------------------------CCTTCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCT-----------------------------------------------------------------------------
---------------------------TCAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCC--------------------------------------------------------------------------
----------------------------CAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCA-------------------------------------------------------------------------
----------------------------CAGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCA-------------------------------------------------------------------------
-----------------------------AGGCTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCAT------------------------------------------------------------------------
--------------------------------CTGCTCTGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTAT---------------------------------------------------------------------
--------------------------------------TGAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAAC---------------------------------------------------------------
---------------------------------------GAGCCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACA--------------------------------------------------------------
------------------------------------------CCCAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAATCTGCCATTATCCCAACACAA-----------------------------------------------------------
--------------------------------------------CAGAGCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAAC---------------------------------------------------------
------------------------------------------------GCAGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATC-----------------------------------------------------
--------------------------------------------------AGAATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCAC---------------------------------------------------
-----------------------------------------------------ATGGTCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCC------------------------------------------------
---------------------------------------------------------TCATCACAGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATT--------------------------------------------
----------------------------------------------------------------AGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGC-------------------------------------
----------------------------------------------------------------AGCTCTCCTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGC-------------------------------------
-------------------------------------------------------------------TCTCCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCC----------------------------------
----------------------------------------------------------------------CCTCAACTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACA-------------------------------
-----------------------------------------------------------------------CTCAACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAG------------------------------
--------------------------------------------------------------------------AACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAAT---------------------------
--------------------------------------------------------------------------AACTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAAT---------------------------
----------------------------------------------------------------------------CTTGGCATTGCCTGAGATCAGGACGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCA-------------------------
----------------------------------------------------------------------------CTTGGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCA-------------------------
-------------------------------------------------------------------------------GGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTA----------------------
-------------------------------------------------------------------------------GGCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTA----------------------
--------------------------------------------------------------------------------GCATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAG---------------------
----------------------------------------------------------------------------------ATTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGG-------------------
-----------------------------------------------------------------------------------TTGCCTGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGG------------------
----------------------------------------------------------------------------------------TGAGATCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAG-------------
---------------------------------------------------------------------------------------------TCAGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATG--------
-----------------------------------------------------------------------------------------------AGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATGAA------
-----------------------------------------------------------------------------------------------AGGATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATGAA------
-------------------------------------------------------------------------------------------------G-TGGCTGCATGCCTAGAGGGACAAGCTGCCATTATCCCAACGCAAACCATCCCCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATGAAGGC---
--------------------------------------------------------------------------------------------------ATGGCTGCATGCCCAGAGGGACAAGCTGCCATTATCCCAACACAAACCATCACCCCTATTTTGTCGCGCCACAGAATCAGTAGGGGCACAGAGATGAAGGC---

Why these applications?

Why did learn about these applications this week? Other than that they are really cool…

What I hope you observed is that these algorithms and approaches leveraged the same skills and tools you have been learning about all semester:

  • We utilized (nested) loops (something that may be relevant for project 2!)
  • We utilized recursion
  • Throughout we picked relevant and appropriate data structures, from lists, tuples, dictionaries, etc. for the particular needs of the algorithm
  • We decided which style of loops - for or while - was needed/most appropriate
  • We used tools for reading files and reading data URLs

The tools in your toolbox are real tools that can be applied to real problems. There aren’t secret features I have been holding back. Instead as you proceed in CS you will use these same tools in ever more complex combinations. That is the complexity and capability arises from how we combine the tools we have already learned about.

A computer is a meta device. But it is still limited by the software others have provided. Developing algorithmic problem solving and implementation skills enables you to solve any problem, not just those someone else thought to solve already!