Class 15: Lists and Files Problems

Objectives for today

  1. Develop a more complex program with lists and files

Genotype Frequencies

Genes have different versions, or variants, termed alleles. Humans are diploid organisms, that is we have two copies of each chromosome (excluding sex chromosomes), one inherited from our mother, the other from our father. Thus for the alleles ‘a’ and ‘A’, we can have three possible genotypes: ‘aa’, ‘aA’, ‘AA’, i.e. 0, 1, or 2 copies of the ‘A’ allele.

A very useful statistic is the genotype frequency, that is how common is a given genotype in the population. Assume you have a file containing a set of genotypes expressed in terms of allele counts, i.e., one of 0, 1 or 2. Write a function genotype_freq that takes a filename as a parameter and returns a list of genotype frequencies.

How can we do so? As always we want to decompose our problem into smaller pieces. Show a possible decomposition…

Here we could decompose our problem into the following functions:

  1. read_file, for reading the file into a list of integers,
  2. freq, for computing the genotype frequencies from that list, and
  3. gt_freq, that has the filename as a parameter and integrates read_file and freq.

For part one, we can read a file line by line into a list with a for loop. Recall that the allele count is an integer. Thus what else will we need to do to create a list of integers from our file? Show code…

Since the lines will be strings we will also need to convert each line to an integer with the int function.

def read_file(filename):
    """Read file of integers into list, assuming one integer per line
    
    Args:
        filename: Filename as string
    
    Returns:
        List of integers
    """
    with open(filename, "r") as file:
        counts = []
        for line in file:
            counts.append(int(line.strip()))
        return counts

The frequency computation involves counting the number of each possible genotype and dividing by the total number of individuals in the file. Since there are only 3 possible genotypes, a natural starting point is to create three counter variables, one for each possible genotype, e.g.,

def freq(counts):
    gt_0 = 0 # Count of allele count of 0
    gt_1 = 0
    gt_2 = 0
    for allele_count in counts:
        if allele_count == 0:
            gt_0 += 1
        elif allele_count == 1:
            gt_1 += 1
        elif allele_count == 2:
            gt_2 += 1
    # Then normalize the counts...

However, anytime we find ourselves creating multiple instances of a variable, that is often a signal we should be using a different way to structure our data, i.e., a different data structure. How could we use a list data structure to implement a more concise and maintainable version of this function? Show an approach…

Here we could convert the three gt_ variables into a list of length 3, i.e.:

def freq(counts):
    gts = [0, 0, 0] # Counts of allele count 0, 1, and 2
    for allele_count in counts:
        if allele_count == 0:
            gts[0] += 1
        elif allele_count == 1:
            gts[1] += 1
        elif allele_count == 2:
            gts[2] += 1
    # Then normalize the counts...

In this new version, gts[0] - the value in the gts list at position 0 - replaces the single value gt_0, gts[1] replaces gt_1, etc.

With this change we have already made the code more concise. But we also notice that in each of our if-elif branches, the value of allele_count is the same as the index into gts. Based on that observation, we could then rewrite the code as follows without changing how it executes:

def freq(counts):
    gts = [0, 0, 0] # Counts of allele count 0, 1, and 2
    for allele_count in counts:
        if allele_count == 0:
            gts[allele_count] += 1
        elif allele_count == 1:
            gts[allele_count] += 1
        elif allele_count == 2:
            gts[allele_count] += 1
        # Then normalize the counts...
    ...

We now notice that all three branches are executing the exact same code: gts[allele_count] += 1, and thus we could eliminate the if statement entirely without changing the computation:

def freq(counts):
    gts = [0, 0, 0] # Counts of allele count 0, 1, and 2
    for allele_count in counts:
        gts[allele_count] += 1
    # Then normalize the counts...

Here we take advantage of the act that the allele counts correspond to valid list indices, i.e., the first index will be the frequency of ‘aa’ or an allele count of 0, the second index will be the frequency of an allele count of 1, etc. The resulting code is more concise and easier to maintain.

Our last step is to add a second loop to normalize the genotype counts by the total number of individuals (to calculate frequencies). Putting it all together:

def freq(counts):
    """Compute genotype frequency from list of allele counts
    
    Args:
        counts: List of integer allele counts (containing 0, 1, 2)
    
    Returns:
        List of length 3 with genotype frequencies
    """
    # Initialize list of counts for each genotype: 0 (aa), 1 (aA), 2 (AA) 
    gts = [0] * 3
    for allele_count in counts:
        gts[allele_count] += 1
    # Divide by the number of individuals to get frequencies
    for i in range(len(gts)):
        gts[i] /= len(counts)
    return gts

And finally putting it all together:

def gt_freq(filename):
    """
    Compute genotype frequencies from file of genotypes
    
    Args:
        filename: Filename of genotypes files
    
    Returns:
        List of length 3 with genotype frequencies
    """
    counts = read_file(filename)
    return freq(counts)

Running this code on an a file lecture10.5-alleles.txt containing

0
1
1
1
1
1
1
1
1
2

computes the genotype frequencies of [0.1, 0.8, 0.1]:

>>> gt_freq("lecture10.5-alleles.txt")
[0.1, 0.8, 0.1]

Hardy-Weinberg Equilibrium

The Hardy-Weinberg principle states that in the absence of other selective pressures, genotype (and allele) frequencies will remain constant from generation to generation. Deviations from Hardy-Weinberg equilibrium are often used as a quality control procedure in population-scale genomic analyses (among many other applications). Let’s write another function, hardy_weinberg, that prints the observed genotypes frequencies and the expected genotype frequencies under Hardy-Weinberg equilibrium. Our function will take the genotype frequencies list as its argument.

Let’s define the frequency of the ‘A’ allele (the one we counted) as \(p\) and frequency of the ‘a’ allele as \(q=1-p\). We can compute \(p\) as \(p = \frac{2f(AA) + f(aA)}{2}\). Under Hardy-Weinberg equilibrium the expected frequencies of ‘aa’, ‘aA’, and ‘AA’ are \(q^2\), \(2pq\) and \(p^2\) (you can derive these frequencies from the Punnett square). Let’s calculate these frequencies in our hardy_weinberg function. Show a possible implementation…

def hardy_weinberg(gt_freqs):
    """Print expected and observed genotype frequencies, assuming Hardy-Weinberg
    equilibrium
    
    Args:
        gt_freqs: List of length 3 with diploid genotype frequencies
    """
    # Raise an error if the input list has the incorrect size
    assert len(gt_freqs) == 3, "Genotype frequency list should have 3 entries"
    
    p = (2*gt_freqs[2] + gt_freqs[1]) / 2 # Frequency of 'A' allele
    q = 1 -p # Frequency of 'a' allele
    
    # Expected genotype frequencies under Hardy-Weinberg equilibrium
    expected_freq = [q**2, 2*p*q, p**2]
    
    # Print expect and actual frequencies
    for i in range(3):
        print("AC",i,"Expected freq", expected_freq[i], "Actual freq", gt_freqs[i])

The Hardy-Weinberg Wikipedia page includes example data for the scarlet tiger moth, with reported genotype frequencies of [0.9113, 0.0856, 0.0031]. We can run our new function on this data:

>>> hardy_weinberg([0.9113, 0.0856, 0.0031])
AC 0 Expected freq 0.9103068099999999 Actual freq 0.9113
AC 1 Expected freq 0.08758637999999999 Actual freq 0.0856
AC 2 Expected freq 0.0021068099999999998 Actual freq 0.0031

We notice that the expected and observed frequencies are very similar, as would expect for population in Hardy-Weinberg equilibrium! The next steps would be to perform statistical tests of the hypothesis that the population was not in Hardy-Weinberg equilibrium.