Class 12

Objectives for today

  1. Utilize design patterns to develop solutions
  2. Develop a more complex program with for and while loops

Simulating population genetics (credit)

Genes have different versions, or variants, termed alleles. These different alleles can be associated with different traits, e.g. do you taste certain chemicals as bitter. Population genetics is the study of how evolutionary processes affect the frequencies of alleles in the population. For example, if a population starts with a mixture of two alleles and if there is no advantage for one allele over the other, then one of the alleles will eventually disappear and the other will be present in 100% of the individuals (described as becoming fixed in the population).

To convince ourselves of this phenomenon, we are going to create a simple of simulation of a haploid organism (just one copy of chromosome) that has two alleles ‘a’ and ‘A’. We will represent our population of size n with a string of length n containing the letters ‘a’ and ‘A’. We will then simulate each new generation by randomly sampling from the current population with replacement to create a new population of the same size. We then want to return the number of generations required for one of the alleles to become fixed.

As always we want to decompose our problem into smaller problems that are easier to solve and thus build up the solution piece-by-piece. How could we break this problem into a set of functions that solve smaller problems and what semantic tools are needed for those functions? Show a possible decomposition…

  1. Write a function named next_gen that takes the current generation as a parameter and returns the next generation.

    What semantic tools are needed here? Likely a for loop, a way to randomly sample from a string, and the string accumulation pattern.

  2. Write a function named pop_sim that takes an initial population as a string and returns the number of generations to required for fixation.

    What semantic tools are needed here? A loop to iterate over the generations. And a way to detect if both alleles are present in the string.

I would suggest we start with next_gen and then implement pop_sim. next_gen has a single string parameter, pop the current population and returns the new population, a string of the same size. An example would be:

>>> pop = "AAAAaAaAAA"
>>> next_gen(pop)
'aAAaAAaAAA'

What “pattern” will this function likely take? We could use the string construction pattern we used in Lab 3 in which we build strings up character by character in a for loop. What would the body of that loop look like? Show a possible implementation…

Recall our pattern looks something like

def next_gen(pop):
    next_pop = ""
    for i in range(len(pop)):
        next_pop += ...
    return next_pop

Here we want to randomly sample from pop with replacement. As we did before we can use indexing and randint, e.g. pop[randint(0, len(pop)-1)]. As you might imagine this a very common task, and so the random module has a choice function to do exactly this kind of sampling. The choice function randomly selects one element from a non-empty sequence. I suspect you will find choice helpful in lab 4 (and beyond). Putting it all together (with a docstring):

def next_gen(pop):
    """
    Generate the next generation by randomly sampling from the
    current population
    
    Args:
        pop: Current population as a string
    
    Returns:
        Next generation as a string
    """
    next_pop = ""
    for i in range(len(pop)):
        next_pop += choice(pop)
    return next_pop

Now let’s turn to pop_sim, which has a single string parameter, pop, the initial population and returns the number of generations till fixation. We need a loop to generate the successive generations, but do we know how many generations we will have to simulate? No. So which type of loop will we need? A while loop. Why could we use a for loop for next_gen, but not here?

In next_gen, we know the number of iterations (the size of the population or the length of the string) and thus could (and should) use a for loop.

When should the loop in pop_sim terminate? When one allele becomes fixed. Alternately, when should the loop keep executing? As long as both alleles are present in the population, i.e. both “a” and “A” in the population string. How can we express that as a conditional statement?

The in operator returns True if its left-hand side is present in the right-hand side operand.

"a" in pop and "A" in pop

Thus our loop looks like:

while "a" in pop and "A" in pop:
    pop = next_gen(pop)

If we want to keep track of the number of generations, we need to count how many times the loop executes. For for loops that is always a known quantity. For while loops we will need to introduce a counter variable that is incremented each time the loop executes. Putting it all together…

def pop_sim(pop):
    """
    Simulate allele fixation in a population
    
    Args:
        pop: Initial population as a string
    
    Returns:
        Integer number of generations need to achieve fixation
    """
    generations = 0
    while "a" in pop and "A" in pop:
        pop = next_gen(pop)
        generations += 1
    return generations

Check out a full implementation including a function to generate an initial population.