Media Party 2023
Founded/Led Open States https://openstates.org
Assistant Clinical Professor, University of Chicago (as of 2022)
Not a machine learning expert.
LLMs like GPT excel at “translation” tasks.
Why not HTML → JSON?
We’ll be focused today on data extraction from HTML.
This is not about using GPT to help you get around captcha, rate limiting, IP blocking, etc.
Can be thought of as two parts:
Sometimes making the request is the hardest part:
https://scrapple.fly.dev/staff
import requests
import lxml.html
def get_links():
url = "https://scrapple.fly.dev/staff"
links = []
while True:
# make the request and parse the response
response = requests.get(url)
tree = lxml.html.fromstring(response.text)
tree.make_links_absolute(url)
# grab the links
links += tree.xpath("//a[contains(@href, '/staff/')]/@href")
# check if there's a next page
try:
url = tree.xpath("//a[contains(text(), 'Next')]/@href")[0]
except IndexError:
break
return links
links = get_links()
print(len(links), "detail links collected")
45 detail links collected
def get_details(url):
response = requests.get(url)
tree = lxml.html.fromstring(response.text)
name = tree.xpath("//h2/text()")[0].replace("Employee Details for ", "")
position = tree.xpath("//dd[@id='position']/text()")[0]
status = tree.xpath("//dd[@id='status']/text()")[0]
hired_date = tree.xpath("//dd[@id='hired']/text()")[0]
return {
"name": name,
"position": position,
"status": status,
"hired_date": hired_date,
}
print(get_details(links[0]))
print("...")
print(get_details(links[-1]))
{'name': 'Eric Sound', 'position': 'Manager', 'status': 'Current', 'hired_date': '3/6/1963'}
...
{'name': 'Oscar Ego', 'position': 'Outreach Coordinator', 'status': 'Current', 'hired_date': '10/31/1938'}
This is the most overlooked part of web scraping - you are gonna need it again.
Next week, the page changes!
What was:
Becomes:
So we fix the code, and then there’s a third version…
“Attention Is All You Need” - Vaswani et al
Attention is a mechanism that allows the model to focus on specific parts of the input.
This makes it possible to use the model for tasks like translation, summarization, and question answering.
Attention is a key component of the transformer model, but is limited in the length of dependencies it can learn.
https://platform.openai.com/tokenizer
Input Text | Tokenized Text | Number of Tokens |
---|---|---|
This is sample text. | [“This”, ” is”, ” sample”, “text”, “.”] | 5 |
Tokenization favors common English words. | [“Token”, “ization”, ” favors”, ” common”, ” English”, ” words”] | 6 |
2q09o3pdsjolikfj092qo3 | [“2”, “q”, “09”, “o”, “3”, “pd”, “s”, “j”, “ol”, “ik”, “f”, “j”, “09”, “2”, “q”, “o”, “3”] | 17 |
<b>This is sample text.</b> | [<, b, >, This, is, sample, text, .</, b, >] | 10 |
Attention is related to the square of the number of tokens, increasing the number of tokens increases the computational cost quadratically.
This talk is focused on OpenAI because they are the clear leader in this space today.
Model | Parameters | Token Limit | Cost Per 1k Tokens |
---|---|---|---|
GPT-3.5 Turbo | 175 billion | 4096 | $0.002 |
GPT-4 | ~1 trillion | 8192 (32k version coming “soon”) |
$0.03-0.12 |
import openai
openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
Model | Parameters | Training Data | Token Limit |
---|---|---|---|
Anthropic Claude | 52 billion | 400 billion tokens | 100k |
Google PaLM 2 | 340 billion | 3.6 trillion tokens | 4096 |
Facebook LLaMa | Up to 65 billion | 1.4 trillion tokens | 2048 |
Models like Codex and Copilot are trained on code, and can generate code that is syntactically correct.
GPT-4 is a decent coder itself.
“Please write a Python scraper to extract data from https://scrapple.fly.dev/staff and return it as JSON”
import requests
from bs4 import BeautifulSoup
response = requests.get("https://scrapple.fly.dev/staff")
soup = BeautifulSoup(response.content, "html.parser")
# Find all staff members
staff_members = soup.find_all("div", class_="staff-member")
results = []
# Iterate over each staff member and extract the desired data
for staff_member in staff_members:
name = staff_member.find("h2", class_="staff-member-name").text.strip()
position = staff_member.find("h3", class_="staff-member-position").text.strip()
bio = staff_member.find("div", class_="staff-member-bio").text.strip()
results.append({"name": name, "position": position, "bio": bio})
GPT-3.5 generates the code with no complaints.
GPT-4 will at least warn that it can’t access the URL, but then generate nearly identical code.
As you may have guessed, the solution to this is to add the HTML as context.
Why not have GPT generate the XPath or CSS selectors?
We’re passing the context anyway, we can ask GPT to extract the data for us.
“When provided with HTML, return the equivalent JSON in the format {‘name’: ’‘, ’position’: ’‘, ’hired’: ‘YYYY-MM-DD’}”
import requests
import openai
def get_details(url):
response = requests.get(url)
resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "When provided with HTML, return the equivalent JSON in the format {'name': '', 'position': '', 'hired': 'YYYY-MM-DD'}"},
{"role": "user", "content": response.text}
]
)
print(resp.choices[0].message.content)
get_details("https://scrapple.fly.dev/staff/3")
get_details("https://scrapple.fly.dev/staff/4?style=new")
get_details("https://scrapple.fly.dev/staff/5?style=experimental")
{"name": "Christopher Edwards", "position": "Help Desk", "hired": "1948-10-17"}
{"name": "Ashley Taylor", "position": "Security Specialist", "hired": "1948-10-18"}
{'name': 'Michael Hernandez', 'position': 'Security Administrator', 'hired': '2019-02-15'}
Sometimes it returns additional content:
(Note: The 'status' field from the HTML is not included in the JSON format because it doesn't match any of the specified keys.)
This lets us know we left data on the table, which is cool. But it also breaks the JSON parsing.
A more sophisticated prompt can be used to control these kinds of cases.
You can also add additional context to coax GPT into returning the data you want.
scrapeghost
library.)When using this direct approach:
https://jamesturk.github.io/scrapeghost/
from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)
resp = scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
resp.data
{"name": "Emanuel 'Chris' Welch",
"url": "https://www.ilga.gov/house/Rep.asp?MemberID=3071",
"district": "7th", "party": "D",
"photo_url": "https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg",
"offices": [
{"name": "Springfield Office",
"address": "300 Capitol Building, Springfield, IL 62706",
"phone": "(217) 782-5350"},
{"name": "District Office",
"address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154",
"phone": "(708) 450-1000"}
]}
scrapeghost
already attempts automatic token reduction, but this could be improved.
To figure out a lot of this, we need a corpus of scraped data so we can see how different models, approaches to token reduction, etc. perform.
Enormous (effectively unlimited) attention windows are likely coming.
Let’s scrape https://mediapartychicago2023.sched.com/
https://mastodon.social/@jamesturk
contact@jamesturk.net
https://jamesturk.net/presentations/scrapeghost-mediaparty-2023/