In our test project we implemented a function to find open reading frames (ORF)
in a genome. A next step would be to simulate the synthesis of a protein, i.e.
simulate amino acid translation from the sequence (for more detail on
transcription and translation check out this
video). We can do so by applying
the mapping of codons (3 DNA
bases) to amino acids. Let’s write a short function, translate
(aminoacid.py), which takes a list of ORFs (like produced by
gene_finder
), and produces a list of translated proteins. For example:
>>> translate(['ATG', 'ATGCCATGTGAA', 'ATGGCATT'])
['M', 'MPCE', 'MA']
What would be a natural data structure to implement that mapping? Show an answer…
A dictionary. The keys would be codons and the values would be the corresponding amino acid.
On subtlety is that our ORF finder can returns ORFs that don’t end with a valid, that is have partial codon at the end. What will happen if we try to use an invalid codon as a key to our CODONS
dictionary? We will get a key error. How could we handle that case? Some potential approaches (by no means an exhaustive list):
get
method to return an empty string when the codon is invalidAny of these are viable approaches in this context.
What other tools will we need out of our toolbox?
for
loops, an outer loop to process the multiple ORFs, the inner loop to translate a single ORFLet’s focus on the inner loop first. We can use range
to iterate by codon, i.e. 3 characters at a time. Here we use get
with the CODONS
dictionary to ignore partial codons (by turning them into the empty string)
protein = ""
for i in range(0, len(orf), 3):
codon = orf[i:i+3]
# Use get to ignore incomplete codons (at the end the sequence)
amino = CODONS.get(codon, "")
protein += amino
How could we embed this code in the outer loop to translate multiple ORFs? Show an implementation…
def translate(orfs):
"""
Translate a list of ORFs into a list amino acid sequences
Args:
orfs: List of ORFs starting with start codon
Returns:
List of potential protein translations
"""
proteins = []
for orf in orfs:
protein = ""
for i in range(0, len(orf), 3):
codon = orf[i:i+3]
# Use get to ignore incomplete codons (at the end the sequence)
amino = CODONS.get(codon, "")
protein += amino
proteins.append(protein)
return proteins
How could/should we test our function to build confidence that is works as intended? Thinking back to our earlier discussion of boundaries, what are some different regimes/boundaries we would want to test?
Some examples of these tests:
>>> print(translate(["ATG"]))
['M']
>>> print(translate(["ATG", "ATGCCATGTGAA", "ATGGCAT", "ATGGCATT"]))
['M', 'MPCE', 'MA', 'MA']
The latter tests multiple ORFs, one or more than one codons and the lengths of partial codons at the end of the sequence, while still having short enough sequences that we can manually determine the correct answers.
How could we integrate our translate function with our test project gene
finder? We could copy those functions into our file. A better approach would be to import our test project like we import math
or random
. Recall that every Python file is also a module, so assuming that both Python files are in the same directory, we can import tp1_genes
to run the gene_finder
.
>>> import tp1_genes
>>> translate(tp1_genes.gene_finder("AATGCCATGTGAA"))
['M', 'MPCE', 'MA']