March 10, 2022
If you want to skip directly to the code, visit this GitHub repository.
For about half a year now, I have been a fan of writing web scrapers and automating them with GitHub’s powerful Actions feature. The free GitHub feature — originally created for engineers to build pipelines for testing, releasing and deploying software — is an efficient way for journalists to run Python web scrapers on a schedule, without having to set up a server.
Until a couple of months ago, however, I couldn’t get GitHub Actions to set up a WebDriver to scrape with Selenium.
While requests
allows us to send HTTP requests to get data from websites, selenium
truly shines where we need to push buttons, fill input forms, scroll and take screenshots in a browser window.
As journalism educator Jonathan Soma explains in this video, Selenium is a browser automation software to control a web browser, such as Chrome or Firefox. It uses a WebDriver to pass commands to — and control — the web browser. So, the first step in building a Selenium web scraper is to set up a WebDriver.
There are two problems — headaches, rather — with this approach:
webdriver-manager
automatically manages your WebDriver to make it match your browser version. It’s slick because you don’t have to download anything from the internet, and you can run the script with GitHub Actions.
An example: for a side project, I am scraping data for journalists killed on the line of duty every day. It’s collected by the Committee to Protect Journalists, and lives in a tabular format spread across multiple pages. The URL does not change when a user switches between pages, and, therefore, requests
alone wouldn’t do the trick.
Here’s how I built the scraper using Python, Selenium
, webdriver-manager
and GitHub Actions:
In the Python script, I install and imported all the libraries I was going to need. You may have to use pip
to install webdriver-manager
.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import selenium
import csv
import time
Here, I have used a WebDriver for Chrome.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
Headless means that the WebDriver will not fire up a GUI. If you’re testing your code, you may want to comment it out.
chrome_options = Options()
chrome_options.add_argument("--headless")
In this step, webdriver-manager
will install the ChromeDriver. And options will be set to the options we set up in Step 3. If you turned off headless
, you will see a browser instance firing up in this step.
driver = webdriver.Chrome(
ChromeDriverManager().install(),
options=chrome_options
)
driver.get("https://cpj.org/data/killed")
The usual stuff …
list_of_rows = []
counter = 0
while counter < 100:
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find("table", {"table js-report-builder-table"})
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
if cell.find('a'):
list_of_cells.append(cell.find('a')['href'])
text = cell.text.strip()
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
counter = counter + 1
next_button = driver.find_element_by_xpath('/html/body/div[1]/div/div/div[2]/div/div[1]/div/nav/ul/li[8]/a')
next_button.click()
time.sleep(1)
data = pd.DataFrame(list_of_rows, columns=["link","name", "organization", "date", "location","killed","type_of_death", ""]).dropna().drop_duplicates()
data.to_csv("data.csv",index=False)
I have named this script file scraper.py
and added it a GitHub repository. This is what my Actions workflow looks like. The YAML file is stored in the .github/workflows
directory, as main.yml
— the Action is scheduled to run every day at 8 a.m.
name: Scrape
on:
push:
workflow_dispatch:
schedule:
- cron: "0 8 * * *" # 8 a.m. every day UTC
jobs:
scrape:
runs-on: ubuntu-latest
steps:
# Step 1: Prepare the environment
- name: Check out this repo
uses: actions/checkout@v2
with:
fetch-depth: 0
# Step 2: Install requirements, so Python script can run
- name: Install requirements
run: python -m pip install pandas selenium requests bs4 webdriver-manager
# Step 3: Run the Python script
- name: Run scraper
run: python scrape.py
# Step 4: Commit and push
- name: Commit and push
run: |-
git pull
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
If you’re looking to further learn about GitHub Actions, check out this lesson.
In conclusion, if you use webdriver-manager
, you may never have to go back to downloading a WebDriver — and continue re-downloading it every time your browser updates.
Happy web scraping & automating! 🤖
I hope this is helpful. If I’ve gotten something wrong — or if you think this method sucks — please do let me know.