How to analyze student course progression and identify gaps using R?

I’m working with a dataset that tracks student course enrollments. Here’s a sample of what I’m dealing with:

pupils <- c("X", "X", "X", "X", "Y", "Y", "Y")
classes <- c("Class1", "Class2", "Class3", "Class4", "Class1", "Class5", "Class6")
enrollment_dates <- c('2019-03-15', '2019-04-20', '2019-08-10', '2019-12-05', '2019-03-15', "2019-07-01", '2019-10-20')
data <- data.frame(pupils, classes, enrollment_dates)

data$enrollment_dates <- as.Date(data$enrollment_dates)

I’m trying to figure out:

  1. Which class each student started with
  2. If there was a break in their studies (defined as > 3 months between class start dates)
  3. Whether they picked up another class after a break

I’ve made some progress using dplyr, but I’m stuck on the last part. How can I identify if a student continued after a break? Any help would be great!

I’ve worked with similar datasets in my research, and here’s an approach that might help:

Use the tidyverse package suite for efficient data manipulation. Start by arranging the data by pupil and date, then group by pupil. Calculate the time difference between consecutive classes for each student. You can then use case_when() to categorize breaks and identify if students continued after a break.

Here’s a sketch of the code:

library(tidyverse)

result <- data %>%
  arrange(pupils, enrollment_dates) %>%
  group_by(pupils) %>%
  mutate(
    time_diff = as.numeric(enrollment_dates - lag(enrollment_dates)),
    has_break = time_diff > 90,
    continued_after_break = lead(!has_break, default = FALSE)
  ) %>%
  ungroup()

This should give you a good starting point. You might need to adjust the logic for ‘continued_after_break’ depending on your specific requirements. Hope this helps!

Hey there, Iris_92Paint! :wave: That’s a really interesting dataset you’re working with. I’m curious about what you’ve tried so far with dplyr. Have you managed to identify the breaks in studies yet?

I’m thinking we could probably use some kind of lag function to compare dates between classes for each student. Maybe something like:

library(dplyr)

data %>%
  group_by(pupils) %>%
  arrange(enrollment_dates) %>%
  mutate(
    days_since_last_class = as.numeric(enrollment_dates - lag(enrollment_dates)),
    had_break = ifelse(days_since_last_class > 90, TRUE, FALSE),
    continued_after_break = lead(had_break == FALSE)
  )

This is just off the top of my head, so it might need some tweaking. What do you think? Have you tried anything similar?

Also, I’m curious about what you’re planning to do with this information once you’ve identified the breaks and continuations. Are you looking at student retention patterns or something like that? It sounds like a really cool project!

yo iris, i’ve dealt w/ similar stuff before. try this:

library(dplyr)

data %>%
  group_by(pupils) %>%
  arrange(enrollment_dates) %>%
  mutate(
    first_class = first(classes),
    gap = difftime(enrollment_dates, lag(enrollment_dates), units='days') > 90,
    resumed = lead(ifelse(gap, TRUE, FALSE))
  )

this should give u what ur lookin for. lmk if u need more help!