Class 21: Applications of modules, dictionaries, etc.

Objectives for today

Problem Solving with Dictionaries: Amino Acid translation

In our test project we implemented a function to find open reading frames (ORF) in a genome. A next step would be to simulate the synthesis of a protein, i.e. simulate amino acid translation from the sequence (for more detail on transcription and translation check out this video). We can do so by applying the mapping of codons (3 DNA bases) to amino acids. Let’s write a short function, translate (aminoacid.py), which takes a list of ORFs (like produced by gene_finder), and produces a list of translated proteins. For example:

>>> translate(['ATG', 'ATGCCATGTGAA', 'ATGGCATT'])
['M', 'MPCE', 'MA']

What would be a natural data structure to implement that mapping? Show an answer…

A dictionary. The keys would be codons and the values would be the corresponding amino acid.

On subtlety is that our ORF finder can returns ORFs that don’t end with a valid, that is have partial codon at the end. What will happen if we try to use an invalid codon as a key to our CODONS dictionary? We will get a key error. How could we handle that case? Some potential approaches (by no means an exhaustive list):

Any of these are viable approaches in this context.

What other tools will we need out of our toolbox?

Let’s focus on the inner loop first. We can use range to iterate by codon, i.e. 3 characters at a time. Here we use get with the CODONS dictionary to ignore partial codons (by turning them into the empty string)

protein = ""
for i in range(0, len(orf), 3):
    codon = orf[i:i+3]
    # Use get to ignore incomplete codons (at the end the sequence)
    amino = CODONS.get(codon, "")
    protein += amino

How could we embed this code in the outer loop to translate multiple ORFs? Show an implementation…

def translate(orfs):
    """
    Translate a list of ORFs into a list amino acid sequences
    
    Args:
        orfs: List of ORFs starting with start codon
    
    Returns:
        List of potential protein translations
    """
    proteins = []    
    for orf in orfs:
        protein = ""
        for i in range(0, len(orf), 3):
            codon = orf[i:i+3]
            # Use get to ignore incomplete codons (at the end the sequence)
            amino = CODONS.get(codon, "")
            protein += amino
        proteins.append(protein)
    return proteins

Testing

How could/should we test our function to build confidence that is works as intended? Thinking back to our earlier discussion of boundaries, what are some different regimes/boundaries we would want to test?

Some examples of these tests:

>>> print(translate(["ATG"]))
['M']
>>> print(translate(["ATG", "ATGCCATGTGAA", "ATGGCAT", "ATGGCATT"]))
['M', 'MPCE', 'MA', 'MA']

The latter tests multiple ORFs, one or more than one codons and the lengths of partial codons at the end of the sequence, while still having short enough sequences that we can manually determine the correct answers.

Integration

How could we integrate our translate function with our test project gene finder? We could copy those functions into our file. A better approach would be to import our test project like we import math or random. Recall that every Python file is also a module, so assuming that both Python files are in the same directory, we can import tp1_genes to run the gene_finder.

>>> import tp1_genes
>>> translate(tp1_genes.gene_finder("AATGCCATGTGAA"))
['M', 'MPCE', 'MA']