Automating Web Scraping with GPT

Media Party 2023

James Turk

My Background

Founded/Led Open States https://openstates.org
- bills, votes, legislators, committees, events for 50 states, DC, Puerto Rico
- ~250 actively maintained scrapers
- 13+ years of maintenance
Assistant Clinical Professor, University of Chicago (as of 2022)
Not a machine learning expert.

Audience Q’s

Who has written a web scraper?
Who has maintained a web scraper for more than a year?
- …more than 5?
- …more than 10?
Frameworks?

Today’s Talk

LLMs like GPT excel at “translation” tasks.

English → Spanish
English → Python
Python → Javascript

Why not HTML → JSON?

Very Early Proof of Concept

Initial proof of concept in February, but GPT-3 token limits seemed too low to be useful.
March 14 2023: GPT-4 released with 8k token limit and promise of 32k around the corner.
March/April 2023: Published my experiences with this approach.
May 2023: Working on various applications and refining the approach.
Summer 2023: Focusing on gathering more real-world data to base future improvements on.

Not Covered Today

We’ll be focused today on data extraction from HTML.

This is not about using GPT to help you get around captcha, rate limiting, IP blocking, etc.

Web Scraping

GPT

Using GPT for Web Scraping

What’s Next?

Demo

Discussion

What is Web Scraping?

Can be thought of as two parts:

Make a request to a website
Parse the response (e.g HTML to DOM) and extract the data you want

Making a Request

import requests

response = requests.get("https://www.example.com")

Parsing the Response & Extracting Data

import requests
import lxml.html

response = requests.get("https://www.example.com")

# process response
tree = lxml.html.fromstring(response.text)
title = tree.xpath("//title/text()")[0]
description = tree.cssselect("meta[name=description]")[0].get("content")

Aside: What isn’t covered?

Sometimes making the request is the hardest part:

Authentication (cookies, sessions, etc)
Captcha/Rate Limiting/Blocking
Javascript-heavy sites might string together content from many requests.

Let’s Write a Scraper

https://scrapple.fly.dev/staff

Two kinds of pages: list and detail.
For list pages we’ll just grab the links & check if there’s a next page.
For detail pages we’ll grab the name, title, and bio.

Writing the Scraper: List Pages

<table id="employees" style="max-height: 100%;">
      <thead>
        <tr>
          <th>First Name</th>
          <th>Last Name</th>
          <th>Position Name</th>
          <th>&nbsp;</th>
        </tr>
      </thead>
      <tbody>
      
      <tr>
        <td>Eric</td>
        <td>Sound</td>
        <td>Manager</td>
        <td><a href="/staff/52">Details</a></td>
      </tr>

Writing the Scraper: List Pages

import requests
import lxml.html


def get_links():
    url = "https://scrapple.fly.dev/staff"
    links = []

    while True:
        # make the request and parse the response
        response = requests.get(url)
        tree = lxml.html.fromstring(response.text)
        tree.make_links_absolute(url)

        # grab the links
        links += tree.xpath("//a[contains(@href, '/staff/')]/@href")

        # check if there's a next page
        try:
            url = tree.xpath("//a[contains(text(), 'Next')]/@href")[0]
        except IndexError:
            break

    return links


links = get_links()
print(len(links), "detail links collected")

45 detail links collected

Writing the Scraper: Detail Pages

<h2 class="section">Employee Details for Eric Sound</h2>
<div class="section">
    <dl>
    <dt>Position</dt>
    <dd id="position">Manager</dd>
    <dt>Status</dt>
    <dd id="status">Current</dd>
    <dt>Hired</dt>
    <dd id="hired">3/6/1963</dd>
    </dl>
</div>

Writing the Scraper: Detail Pages

def get_details(url):
    response = requests.get(url)
    tree = lxml.html.fromstring(response.text)

    name = tree.xpath("//h2/text()")[0].replace("Employee Details for ", "")
    position = tree.xpath("//dd[@id='position']/text()")[0]
    status = tree.xpath("//dd[@id='status']/text()")[0]
    hired_date = tree.xpath("//dd[@id='hired']/text()")[0]

    return {
        "name": name,
        "position": position,
        "status": status,
        "hired_date": hired_date,
    }


print(get_details(links[0]))
print("...")
print(get_details(links[-1]))

{'name': 'Eric Sound', 'position': 'Manager', 'status': 'Current', 'hired_date': '3/6/1963'}
...
{'name': 'Oscar Ego', 'position': 'Outreach Coordinator', 'status': 'Current', 'hired_date': '10/31/1938'}

Run It Again

This is the most overlooked part of web scraping - you are gonna need it again.

New info is added.
Noticed a bug.
Need to compare transformed data to original data.
…

Breaking the Page

Next week, the page changes!

What was:

<h2 class="section">Employee Details for Eric Sound</h2>
<div class="section">
    <dl>
    <dt>Position</dt>
    <dd id="position">Manager</dd>
    <dt>Status</dt>
    <dd id="status">Current</dd>
    <dt>Hired</dt>
    <dd id="hired">3/6/1963</dd>
    </dl>
</div>

Breaking the Page

Becomes:

<table>
<thead>
    <tr>
    <th>Position</th>
    <th>Status</th>
    <th>Hired</th>
    </tr>
</thead>
<tbody>
    <tr>
    <td>Manager</td>
    <td>Current</td>
    <td>3/6/1963</td>
    </tr>
</tbody>
</table>

Let’s fix this every week forever

So we fix the code, and then there’s a third version…

Let’s fix this every week forever

We’re stuck between choosing general selectors that can be too broad, or specific selectors that will break whenever the page changes.
No single right answer, but for most use cases, breakage is preferable to bad data.
You could do this for 13 years and still not be done…

Web Scraping

GPT

Using GPT for Web Scraping

What’s Next?

Demo

Discussion

What is GPT?

Generative Pre-trained Transformer: the key breakthrough was the advent of the transformer model.
Transformer models use a mechanism called ‘attention’ to understand the context and relationships between words in a sentence, enabling it to generate coherent, human-like text.
Pre-training means it has been trained on a large corpus of text, further training is possible (fine-tuning) but not required.

Key Terms

Attention
Tokens
Training Parameters
Temperature

Attention

“Attention Is All You Need” - Vaswani et al

Attention is a mechanism that allows the model to focus on specific parts of the input.

This makes it possible to use the model for tasks like translation, summarization, and question answering.

Attention is a key component of the transformer model, but is limited in the length of dependencies it can learn.

Tokens

https://platform.openai.com/tokenizer

Input Text	Tokenized Text	Number of Tokens
This is sample text.	[“This”, ” is”, ” sample”, “text”, “.”]	5
Tokenization favors common English words.	[“Token”, “ization”, ” favors”, ” common”, ” English”, ” words”]	6
2q09o3pdsjolikfj092qo3	[“2”, “q”, “09”, “o”, “3”, “pd”, “s”, “j”, “ol”, “ik”, “f”, “j”, “09”, “2”, “q”, “o”, “3”]	17
<b>This is sample text.</b>	[<, b, >, This, is, sample, text, .</, b, >]	10

Attention is related to the square of the number of tokens, increasing the number of tokens increases the computational cost quadratically.

Performance (Model Properties)

More parameters: more accurate modeling, but slower training and inference.
More training data: more accurate modeling, but slower training.
Larger model: more accurate modeling, but slower training and significantly increased costs.

Performance (Request Properties)

Higher temperature: more creative, but less accurate.
More tokens: slower responses, higher costs

OpenAI

This talk is focused on OpenAI because they are the clear leader in this space today.

GPT-4 is vastly superior to other offerings right now.
They have a public API (though there’s a waitlist for GPT-4, 3.5-turbo is available now).
Not an endorsement of their business
- Means of acquiring data.
- Leadership with questionable backgrounds.
- Dropped the “open” part of their name quickly.

GPT-3.5 vs GPT-4

Model	Parameters	Token Limit	Cost Per 1k Tokens
GPT-3.5 Turbo	175 billion	4096	$0.002
GPT-4	~1 trillion	8192 (32k version coming “soon”)	$0.03-0.12

OpenAI’s Chat Completions API

import openai

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
)

Beyond OpenAI

Model	Parameters	Training Data	Token Limit
Anthropic Claude	52 billion	400 billion tokens	100k
Google PaLM 2	340 billion	3.6 trillion tokens	4096
Facebook LLaMa	Up to 65 billion	1.4 trillion tokens	2048

https://www.anthropic.com/index/100k-context-windows

Web Scraping

GPT

Using GPT for Web Scraping

What’s Next?

Demo

Discussion

Just ask GPT to generate the code

Models like Codex and Copilot are trained on code, and can generate code that is syntactically correct.

GPT-4 is a decent coder itself.

“Please write a Python scraper to extract data from https://scrapple.fly.dev/staff and return it as JSON”

But…

import requests
from bs4 import BeautifulSoup

response = requests.get("https://scrapple.fly.dev/staff")
soup = BeautifulSoup(response.content, "html.parser")

# Find all staff members
staff_members = soup.find_all("div", class_="staff-member")

results = []

# Iterate over each staff member and extract the desired data
for staff_member in staff_members:
    name = staff_member.find("h2", class_="staff-member-name").text.strip()
    position = staff_member.find("h3", class_="staff-member-position").text.strip()
    bio = staff_member.find("div", class_="staff-member-bio").text.strip()

    results.append({"name": name, "position": position, "bio": bio})

GPT-3.5 generates the code with no complaints.

GPT-4 will at least warn that it can’t access the URL, but then generate nearly identical code.

Asking GPT to generate the code, with context

As you may have guessed, the solution to this is to add the HTML as context.

Need to fit input & output within token limit.
Code generated is often correct for given page, but not robust.
Do you trust GPT to generate code that’s secure unsupervised? You shouldn’t.

Asking GPT to generate the selectors

Why not have GPT generate the XPath or CSS selectors?

Robustness is still a major issue, one page will often give overly-specific selectors.
XPath/CSS selectors are only enough in the simplest cases.

<h2>Employee Details for Eric Sound</h2>
<span>Phone</span><span>555-555-5555</span>
<span>Fax</span><span>555-555-5555</span>

Asking GPT to extract the data

We’re passing the context anyway, we can ask GPT to extract the data for us.

“When provided with HTML, return the equivalent JSON in the format {‘name’: ’‘, ’position’: ’‘, ’hired’: ‘YYYY-MM-DD’}”

Our Scraper Revisited

import requests
import openai

def get_details(url):
    response = requests.get(url)

    resp = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
                {"role": "system", "content": "When provided with HTML, return the equivalent JSON in the format {'name': '', 'position': '', 'hired': 'YYYY-MM-DD'}"},
                {"role": "user", "content": response.text}
            ]
    )
    print(resp.choices[0].message.content)

get_details("https://scrapple.fly.dev/staff/3")
get_details("https://scrapple.fly.dev/staff/4?style=new")
get_details("https://scrapple.fly.dev/staff/5?style=experimental")

{"name": "Christopher Edwards", "position": "Help Desk", "hired": "1948-10-17"}
{"name": "Ashley Taylor", "position": "Security Specialist", "hired": "1948-10-18"}
{'name': 'Michael Hernandez', 'position': 'Security Administrator', 'hired': '2019-02-15'}

What Just Happened?

We passed the HTML as context as well as a JSON “template”.
GPT extracted the information, including the name, which required both HTML parsing and NLP.
GPT also reformatted the date for us since we asked for YYYY-MM-DD.
It returned valid JSON, which we can use directly. (Except last one!)
Each request cost about $0.002. (2/10 of a cent)
Each of the three page variants worked, despite different HTML structure.

Prompt Engineering

Sometimes it returns additional content:

(Note: The 'status' field from the HTML is not included in the JSON format because it doesn't match any of the specified keys.)

This lets us know we left data on the table, which is cool. But it also breaks the JSON parsing.

A more sophisticated prompt can be used to control these kinds of cases.

Prompt Engineering

You can also add additional context to coax GPT into returning the data you want.

“Ensure that JSON uses double quotes.”
“If data is missing from the HTML, represent it with a null.”
“The user’s name is in the <h2> tag.”
“Ignore positions noted as ‘vacant’.”

Building Guardrails & Constraints

LLMs are a bit unpredictable. Much of this can be controlled with the aforementioned temperature parameter. (temperature=0 is mostly deterministic, which is what we want here)
We can also do some validation on the returned data, and kick the results back to GPT if it doesn’t match our schema. (This approach was suggested by a few people on Mastodon, and has since been incorporated into the scrapeghost library.)

Benefits

When using this direct approach:

Many page changes will be handled automatically.
Pages are not required to be uniform. You can use the same model to scrape many different pages.

Drawbacks

Each request requires a separate API call.
- This can be slow and expensive.
You are limited to the data that is part of the initial response.
The token limit means you’ll need to tailor your approach for large pages.
- Not great for list pages with 4k limit.

scrapeghost

https://jamesturk.github.io/scrapeghost/

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
  schema={
      "name": "string",
      "url": "url",
      "district": "string",
      "party": "string",
      "photo_url": "url",
      "offices": [{"name": "string", "address": "string", "phone": "string"}],
  }
)

resp = scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
resp.data

scrapeghost in Action

{"name": "Emanuel 'Chris' Welch",
 "url": "https://www.ilga.gov/house/Rep.asp?MemberID=3071",
 "district": "7th", "party": "D", 
 "photo_url": "https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg",
   "offices": [
     {"name": "Springfield Office",
      "address": "300 Capitol Building, Springfield, IL 62706",
       "phone": "(217) 782-5350"},
     {"name": "District Office",
      "address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154",
       "phone": "(708) 450-1000"}
   ]}

Bonus CLI

$ scrapeghost https://www.ncleg.gov/Members/Biography/S/436  \
        --schema "{'first_name': 'str', 'last_name': 'str',
        'photo_url': 'url', 'offices': [] }'" \
        --css div.card | python -m json.tool

{
    "first_name": "Gale",
    "last_name": "Adcock",
    "photo_url": "https://www.ncleg.gov/Members/MemberImage/S/436/Low",
    "offices": [
        {
            "type": "Mailing",
            "address": "16 West Jones Street, Rm. 1104, Raleigh, NC 27601"
        },
        {
            "type": "Office Phone",
            "phone": "(919) 715-3036"
        }
    ]
}

Features

Automatic token reduction
Support for giving selector hints to further reduce tokens
Built-in error handling/recovery logic
Optional result validation including hallucination check (was this content really on the page?)
Experimental pagination support & chunking of large pages

Potential Future Improvments

scrapeghost already attempts automatic token reduction, but this could be improved.
- In March it wasn’t clear if this would be solved by a 32k+ model, but it looks like for the forseeable future, 4-8k is a practical limit.

To figure out a lot of this, we need a corpus of scraped data so we can see how different models, approaches to token reduction, etc. perform.

Web Scraping

GPT

Using GPT for Web Scraping

What’s Next?

Demo

Discussion

Ethical Concerns

Making it easier to scrape data isn’t a universally good thing.
How common is “hallucination” in this context? How to avoid/detect?
Prompt injection attacks.
Vendor lock-in.

Enormous Attention Window

Enormous (effectively unlimited) attention windows are likely coming.

Reduce need for token reduction
A different approach for list pages
Multi-page scraping in a single request?

GPT-3.5-equivalent on a laptop

Lower costs and latency
More control over the model
Eliminate vendor dependencies

Fine-tuning

Original plan was to fine-tune using scraped data, but hasn’t been needed.
Combined with other techniques, might be able to get a lot more out of smaller models.
Building corpus of scraped data for future training.

Demo

Let’s scrape https://mediapartychicago2023.sched.com/

Discussion

https://mastodon.social/@jamesturk

contact@jamesturk.net

https://jamesturk.net/presentations/scrapeghost-mediaparty-2023/