# Class 21: Applications of modules, dictionaries, etc.

## Objectives for today

• Use modules and dictionaries in a larger program

## Problem Solving with Dictionaries: Amino Acid translation

In our test project we implemented a function to find open reading frames (ORF) in a genome. A next step would be to simulate the synthesis of a protein, i.e. simulate amino acid translation from the sequence (for more detail on transcription and translation check out this video). We can do so by applying the mapping of codons (3 DNA bases) to amino acids. Let’s write a short function, translate (aminoacid.py), which takes a list of ORFs (like produced by gene_finder), and produces a list of translated proteins. For example:

>>> translate(['ATG', 'ATGCCATGTGAA', 'ATGGCATT'])
['M', 'MPCE', 'MA']


What would be a natural data structure to implement that mapping? Show an answer…

A dictionary. The keys would be codons and the values would be the corresponding amino acid.

On subtlety is that our ORF finder can returns ORFs that don’t end with a valid, that is have partial codon at the end. What will happen if we try to use an invalid codon as a key to our CODONS dictionary? We will get a key error. How could we handle that case? Some potential approaches (by no means an exhaustive list):

• Subtract any partial codons from the the length of the ORF before translation
• Use a conditional to only translate complete codons
• Use the get method to return an empty string when the codon is invalid

Any of these are viable approaches in this context.

What other tools will we need out of our toolbox?

• Nested for loops, an outer loop to process the multiple ORFs, the inner loop to translate a single ORF
• The list and string construction “patterns” to build up the list of proteins and each protein respectively.

Let’s focus on the inner loop first. We can use range to iterate by codon, i.e. 3 characters at a time. Here we use get with the CODONS dictionary to ignore partial codons (by turning them into the empty string)

protein = ""
for i in range(0, len(orf), 3):
codon = orf[i:i+3]
# Use get to ignore incomplete codons (at the end the sequence)
amino = CODONS.get(codon, "")
protein += amino


How could we embed this code in the outer loop to translate multiple ORFs? Show an implementation…

def translate(orfs):
"""
Translate a list of ORFs into a list amino acid sequences

Args:
orfs: List of ORFs starting with start codon

Returns:
List of potential protein translations
"""
proteins = []
for orf in orfs:
protein = ""
for i in range(0, len(orf), 3):
codon = orf[i:i+3]
# Use get to ignore incomplete codons (at the end the sequence)
amino = CODONS.get(codon, "")
protein += amino
proteins.append(protein)
return proteins


### Testing

How could/should we test our function to build confidence that is works as intended? Thinking back to our earlier discussion of boundaries, what are some different regimes/boundaries we would want to test?

• One/more than one ORF
• One/more than one codon
• Partial/no partial codons

Some examples of these tests:

>>> print(translate(["ATG"]))
['M']
>>> print(translate(["ATG", "ATGCCATGTGAA", "ATGGCAT", "ATGGCATT"]))
['M', 'MPCE', 'MA', 'MA']


The latter tests multiple ORFs, one or more than one codons and the lengths of partial codons at the end of the sequence, while still having short enough sequences that we can manually determine the correct answers.

### Integration

How could we integrate our translate function with our test project gene finder? We could copy those functions into our file. A better approach would be to import our test project like we import math or random. Recall that every Python file is also a module, so assuming that both Python files are in the same directory, we can import tp1_genes to run the gene_finder.

>>> import tp1_genes
>>> translate(tp1_genes.gene_finder("AATGCCATGTGAA"))
['M', 'MPCE', 'MA']