CSCI 150 Spring 2020

Prelab 4 Notes

Here we discuss an interesting problem that has parallels with the work you are doing this week for Prelab 4 and Lab 4. There is nothing to turn in for this write-up – it is just for extra practice.

Simulating population genetics (credit)

Genes have different versions, or variants, termed alleles. These different alleles can be associated with different traits, e.g., do you taste certain chemicals as bitter. Population genetics is the study of how evolutionary processes affect the frequencies of alleles in the population. For example, if a population starts with a mixture of two alleles and if there is no advantage for one allele over the other, then one of the alleles will eventually disappear and the other will be present in 100% of the individuals (described as becoming fixed in the population).

To convince ourselves of this phenomenon, we are going to create a simple simulation of a haploid organism (just one copy of chromosome) that has two alleles ‘a’ and ‘A’. We will represent our population of size n with a string of length n containing the letters ‘a’ and ‘A’. We will then simulate each new generation by randomly sampling from the current population to create a new population of the same size. We then want to return the number of generations required for one of the alleles to become fixed.

As always we want to decompose our problem into smaller problems that are easier to solve and thus build up the solution piece-by-piece. A way to decompose the problem:

Write a function named next_generation that takes the current generation as a parameter and returns the next generation.

What semantic tools are needed here? Likely a for loop, a way to randomly sample from a string, and the string accumulation pattern.
Write a function named simulate_population that takes an initial population as a string and returns the number of generations required for fixation.

What semantic tools are needed here? A loop to iterate over the generations. However, we don’t know how many generations will be required so we will need a while loop. And a way to detect if both alleles are present in the string.

If you wanted to write a solution, we suggest you start with next_generation and then implement simulate_population.

The first function is a familiar application of the string construction pattern we used in Lab 3 (in which we build strings up character by character). One new feature is the choice function from the random module. This function randomly selects one element from a non-empty sequence. It effectively implements the very common operation seq[randint(0, len(seq)-1)]. You will likely find choice helpful in Lab 4 (and beyond).

Let’s focus on simulate_population. We need a loop to generate the successive generations, but we don’t know how many generations, and thus loop iterations, will be required. Thus we will need to use a while loop. In contrast, in the next_generation, we know the number of iterations (the size of the population or the length of the string) and thus could (and should) use a for loop.

When should the loop in simulate_population terminate? When one allele becomes fixed. Alternately, when should the loop keep executing? As long as both alleles are present in the population, i.e., both “a” and “A” are in the population string. How can we express that as a conditional statement?

"a" in population and "A" in population

The in operator returns True if its left-hand side is present in the right-hand side operand.

Thus our loop looks like:

while "a" in population and "A" in population:
    population = next_generation(population)

If we want to keep track of the number of generations, we need to count how many times the loop executes. For for loops that is always a known quantity. For while loops we will need to introduce a counter variable that is incremented each time the loop executes.

generations = 0
while "a" in population and "A" in population:
    population = next_generation(population)
    generations += 1

Here is an implementation.