Extracting course discipline data from university catalog using web scraping

Iris_92Paint · April 10, 2025, 3:11am

I’m trying to scrape a university course catalog using R. My code works okay but I’m having trouble getting the right discipline info for each course. Here’s what I’ve got so far:

library(rvest)
library(dplyr)

get_course_info <- function(url) {
  page <- read_html(url)
  
  courses <- page %>%
    html_nodes('.course-listing') %>%
    html_table()
  
  disciplines <- page %>% 
    html_nodes('.discipline-header') %>%
    html_text()
  
  result <- bind_cols(
    courses[[1]],
    discipline = rep(disciplines, each = nrow(courses[[1]]))
  )
  
  return(result)
}

url <- 'https://example-university.edu/courses'
course_data <- get_course_info(url)

This grabs the course info and tries to match it with disciplines. But it’s not quite right. For example, a Polish language course is getting tagged with all language disciplines instead of just Polish.

How can I improve this to accurately match each course to only its specific discipline? I want to avoid hard-coding assumptions about the page structure in case it changes. Any ideas to make this more robust?

Ava_Books · April 19, 2025, 3:43pm

yo iris, i had a similar prob scraping course data. try using xpath instead of css selectors. it’s more flexible for nested stuff. something like:

disciplines <- page %>%
  html_nodes(xpath='//div[contains(@class, "discipline-header")]') %>%
  html_text()

courses <- page %>%
  html_nodes(xpath='//div[contains(@class, "discipline-header")]/following-sibling::div[contains(@class, "course-listing")]') %>%
  html_table()

this might grab the right pairings. lmk if it helps!

MaxRock56 · April 18, 2025, 9:37pm

Hey there, Iris_92Paint! Your web scraping project sounds super interesting. I’ve dabbled in scraping university catalogs before, and it can definitely be tricky to get everything lined up just right.

Have you considered using the structure of the HTML to your advantage? Maybe the discipline headers are nested in a way that you could use to associate them with the correct courses. Something like this might work:

get_course_info <- function(url) {
  page <- read_html(url)
  
  discipline_sections <- page %>%
    html_nodes('.discipline-section')
  
  result <- data.frame()
  
  for (section in discipline_sections) {
    discipline <- section %>%
      html_node('.discipline-header') %>%
      html_text()
    
    courses <- section %>%
      html_nodes('.course-listing') %>%
      html_table()
    
    if (length(courses) > 0) {
      section_data <- bind_cols(
        courses[[1]],
        discipline = rep(discipline, nrow(courses[[1]]))
      )
      result <- bind_rows(result, section_data)
    }
  }
  
  return(result)
}

This assumes each discipline has its own section in the HTML. It might need some tweaking based on the exact structure of the page you’re working with.

What do you think? Does this approach seem like it might work for your situation? I’m curious to hear more about the specific challenges you’re running into. Have you noticed any patterns in how the disciplines are organized on the page?

SurfingWave · April 16, 2025, 10:18am

I’ve encountered similar issues when scraping course catalogs. One approach that’s worked well for me is to leverage the hierarchical structure of the HTML. Try identifying a common parent element for each discipline and its associated courses.

Here’s a potential modification to your code:

get_course_info <- function(url) {
  page <- read_html(url)
  
  discipline_sections <- page %>%
    html_nodes('.discipline-container')  # Adjust selector as needed
  
  result <- lapply(discipline_sections, function(section) {
    discipline <- section %>%
      html_node('.discipline-header') %>%
      html_text()
    
    courses <- section %>%
      html_nodes('.course-listing') %>%
      html_table() %>%
      bind_rows()
    
    if(nrow(courses) > 0) {
      courses$discipline <- discipline
    }
    courses
  }) %>% bind_rows()
  
  return(result)
}

This assumes each discipline and its courses are grouped within a container element. You’ll need to inspect the page source to determine the appropriate selectors. This method should be more robust to changes in page structure and accurately match courses to their specific disciplines.