What's the best way to scrape course info from online learning platforms?

Hey everyone, I’m trying to gather data on online courses for a project. I want to pull course titles, the schools offering them, and brief descriptions from websites like Udacity. Here’s what I’ve tried so far:

import requests
from bs4 import BeautifulSoup

url = 'https://www.example-learning-site.com/courses'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

course_items = soup.find_all('div', class_='course-card')
print(f'Found {len(course_items)} courses')

course_data = []
for item in course_items:
    title = item.find('h3', class_='course-title').text.strip()
    school = item.find('span', class_='school-name').text.strip()
    description = item.find('p', class_='course-description').text.strip()
    course_data.append((title, school, description))

This code kinda works, but it’s not getting all the courses. The website says there are 264 courses, but I’m only grabbing about 225. Plus, I’m getting some errors when I try to extract the text.

Any ideas on how to make this more reliable? Maybe there’s a better way to find the right elements or handle missing data? Thanks for any help!

I’ve dealt with similar issues when scraping course data. Your approach is on the right track, but there are a few tweaks that could help:

  1. Dynamic content: Many sites load courses using JavaScript, so consider using Selenium to render the full page instead of relying solely on requests and BeautifulSoup.

  2. Pagination: It looks like you’re only capturing the first page. Implement logic to navigate through all pages to collect the complete dataset.

  3. Error handling: Use try/except blocks to gracefully handle missing or malformed elements.

  4. Rate limiting: Introduce delays between requests to reduce the risk of being blocked.

Here’s a snippet to help you get started with Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://www.example-learning-site.com/courses')

course_items = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.course-card'))
)

# Continue with your scraping logic here

This approach should help capture all 264 courses more reliably. Make sure to close the driver when you’re done.

Hey Owen_Galaxy! :wave: Sounds like an interesting project you’ve got there. Have you considered using an API instead of web scraping? Many online learning platforms offer APIs that can make data collection a breeze.

But if you’re set on scraping, here’s a thought - what if the courses are loaded dynamically with JavaScript? That could explain why you’re not getting all 264 courses. Maybe try using Selenium to render the full page? It might help capture everything.

Also, I’m curious - what kind of project are you working on with this course data? It sounds pretty cool! Are you building some sort of course aggregator or doing research?

Oh, and don’t forget to double-check the site’s terms of service. Some places aren’t too keen on scraping. Better safe than sorry, right?

Let us know how it goes! I’d love to hear if you manage to snag all those elusive courses. Good luck! :blush:

hey, try using an API. many platforms offer apis that make data collection way easier than scraping. check their docs for info and requirements, sometimes you may need a key. trust me, it’s generally less error-prone than figuring out pagination and js-loaded content.