Advent of Code 2021: Day 6


The Data


For Day 6 we have to model the growth of a population of fish…this is much closer to my comfort zone. See the explanation for today’s challenge here. We are given a vector of integers that represents the time since last reproduction for every fish. When a fish reaches a value of 0 it will reproduce with a clutch size of 1 (which seems a tad too few for a fish…but this is a coding challenge not a biology challenge). The inter-birth interval of an adult is 6 days and the age at first reproduction for a newly born fish is 8 days. Now we have all the information, let’s get started.

#Load data
day6_data <- as.integer(scan(file = "./data/Day6.txt", sep = ","))

day6_data
##   [1] 5 1 1 5 4 2 1 2 1 2 2 1 1 1 4 2 2 4 1 1 1 1 1 4 1 1 1 1 1 5 3 1 4 1 1 1 1
##  [38] 1 4 1 5 1 1 1 4 1 2 2 3 1 5 1 1 5 1 1 5 4 1 1 1 4 3 1 1 1 3 1 5 5 1 1 1 1
##  [75] 5 3 2 1 2 3 1 5 1 1 4 1 1 2 1 5 1 1 1 1 5 4 5 1 3 1 3 3 5 5 1 3 1 5 3 1 1
## [112] 4 2 3 3 1 2 4 1 1 1 1 1 1 1 2 1 1 4 1 3 2 5 2 1 1 1 4 2 1 1 1 4 2 4 1 1 1
## [149] 1 4 1 3 5 5 1 2 1 3 1 1 4 1 1 1 1 2 1 1 4 2 3 1 1 1 1 1 1 1 4 5 1 1 3 1 1
## [186] 2 1 1 1 5 1 1 1 1 1 3 2 1 2 4 5 1 5 4 1 1 3 1 1 5 5 1 3 1 1 1 1 4 4 2 1 2
## [223] 1 1 5 1 1 4 5 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 4 2 1 1 1 2 5 1 4 1 1 1 4 1
## [260] 1 5 4 4 3 1 1 4 5 1 1 3 5 3 1 2 5 3 4 1 3 5 4 1 3 1 5 1 4 1 1 4 2 1 1 1 3
## [297] 2 1 1 4


The Challenges


Challenge 1


First off, we need to determine how many individuals would be in the population after 80 days (assuming no mortality in adults or offspring). The first way we might think to do this is just brute force. Keep the data in its original format and count the number of 0s in every time step to identify reproducers.

final_pop80 <- day6_data

system.time({for (i in 1:80){
  
  #At the start of the time step, work out how many individuals reproduce
  reproducer_location <- final_pop80 == 0
  
  #Reduce the time of all individuals by 1
  final_pop80 <- final_pop80 - 1
  
  #Reset the age of reproducers to 6
  final_pop80[reproducer_location] <- 6
  
  #For every reproducer, add a new individual with a value of 8.
  final_pop80 <- append(final_pop80, rep(8, times = sum(reproducer_location)))
  
}})
##    user  system elapsed 
##   0.026   0.009   0.035

This ran in no time and we can quickly get our answer! It seems that the brute force approach works fine.

length(final_pop80)
## [1] 377263

Challenge 2


For challenge 2 we need to push things a bit further and predict the population after 256 days. Could our brute force approach still work for this? The worrying detail is that without any mortality and a constant rate of reproduction this population of fish is going to be growing exponentially so the size of our integer is going to get very unmanageable very quickly. This is exactly what happens if we try our brute force approach, eventually maxing out our memory and crashing R. I’ve run the code below to show you how much trouble it has but I’ve capped it at 1min run time.

final_pop <- day6_data

R.utils::withTimeout({for (i in 1:256){
    
    #At the start of the time step, work out how 
    reproducer_location <- final_pop == 0
    
    #Reduce the time of all individuals by 1
    final_pop <- final_pop - 1
    
    #Reset the age of reproducers to 6
    final_pop[reproducer_location] <- 6
    
    #For every reproducer, add a new individual with a value of 8.
    final_pop <- append(final_pop, rep(8, times = sum(reproducer_location)))
  }
    #Timeout after 60 seconds
  }, timeout = 60)
## [2022-05-05 09:24:20] TimeoutException: reached elapsed time limit [cpu=60s,
## elapsed=60s]

It seems we need another method. The key here is that (at least in this simplified fish population) every individual with the same number can be grouped together. They have no other distinguishing characteristics. So, instead of starting with a vector of 300 integers, we can group all individuals into age classes and instead just work with a vector of length 9 (values 0 - 8).

#Turn our vector into a table
table_of_ages <- table(day6_data)

#Extend this to include all possible ages.
#Make a named vector so we can easily index it
ages_vector <- tidyr::replace_na(as.vector(table_of_ages[as.character(0:8)]), 0)
names(ages_vector) <- 0:8

ages_vector
##   0   1   2   3   4   5   6   7   8 
##   0 168  31  29  37  35   0   0   0

The values in our vector might change, but the length of the vector will never grow because individuals cannot have a value outside of this range. Now we can run very similar code to above but using our new age classes rather than individuals.

pop_size <- as.numeric(NULL)
system.time({for (i in 1:256){

  reproducers <- as.numeric(ages_vector["0"])

  ages_vector[as.character(0:7)] <- ages_vector[as.character(1:8)]

  ages_vector["6"] <- as.numeric(ages_vector["6"]) + reproducers
  ages_vector["8"] <- reproducers

  pop_size <- append(pop_size, sum(ages_vector))

}})
##    user  system elapsed 
##   0.007   0.000   0.007

As you can see, our final population size is very large…no wonder we couldn’t use the brute force approach.

pop_size[length(pop_size)]
## [1] 1695929023803

BONUS ROUND


Unlike the data we used for this challenge, in reality when studying a population of animals we probably don’t know the exact age of every individual or the precise rules that might govern their reproduction. In many cases, our data might just be population counts. What could we do then?

In this example, we are given information that the population is growing exponentially and this can be modelled using the exponential growth model:

\[N_{t+1} = RN_{t}\]

Where the population at time t (\(N_{t}\)) will grow by a constant reproduction factor (\(R\)). This parameter \(R\) will be different for every animal and population, but we can estimate it by looking at the relationship between \(N_{t}\) and \(N_{t+1}\) in the data we have already. Let’s imagine we only have population size data from the first 100 days. We can use a linear model (lm()) to estimate the relationship between \(N_{t}\) and \(N_{t+1}\) in this subset of data.

Note: We fix the intercept of the model below to be 0 by including ~ 0 + ... in the model formula. This normally wouldn’t be advised, but in this case we know a priori that when \(N_{t}\) is 0, \(N_{t+1}\) must also be 0. An extinct population can’t grow!!

input_data <- data.frame(nt = pop_size[1:100]) %>% 
  mutate(nt_1 = lead(nt))

our_model <- lm(nt_1 ~ 0 + nt, data = input_data)
  
summary(our_model)
## 
## Call:
## lm(formula = nt_1 ~ 0 + nt, data = input_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57015  -1033    -39    766  59762 
## 
## Coefficients:
##    Estimate Std. Error t value            Pr(>|t|)    
## nt 1.091493   0.002916   374.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14150 on 98 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9993, Adjusted R-squared:  0.9993 
## F-statistic: 1.401e+05 on 1 and 98 DF,  p-value: < 0.00000000000000022

The parameter estimate for nt in our model summary above is an estimation of the reproduction factor (\(R\)) for this population. We can now determine the population size at any time using the below equation:

\[N(t) = N_{0}R^t\]

Where \(N_{0}\) is our starting population size and \(t\) is time. How does our population estimate from this exponential model compare to our calculation with exact individual level data?

#Extract R value
R <- our_model$coefficients[1]

#Write out exponential func to estimate N at any time point
Nt_func <- function(N0, R, t){
  
  N0*R^t
  
}

#Run this function from 1 - 256 days
predicted_N <- Nt_func(N0 = length(day6_data), R, t = 1:256)
#Percentage difference between our population estimates
(diff(c(predicted_N[256], pop_size[256]))/pop_size[256])*100
## [1] 4.268097

Although we didn’t have any information about the age of individuals or their exact reproductive behaviour in our exponential growth model we were able to estimate a population size that was only 4% different from the exact value we calculated where we had complete information about every individual. This is an example where estimating a value using a mathematical model can provide a plausible alternative in cases where we don’t know all the details about a particular system.



See previous solutions here:

Post-doctoral researcher

Ecologist and data science, interested in using data science techniques for conservation outcomes.