ExperimentDay 23experimentday-23scraperplaywrightautomationinfrastructuredata

Day 23: The Scraper — Playwright on the VPS, 49 Sites in One Session

We stopped visiting websites manually and built a scraper. Playwright headless running on the VPS, pulling raw text and screenshots from every qualified lead in one session. 49 scraped. 6 were already dead. One crucial lesson: nohup doesn't survive an SSH disconnect. tmux does.

February 23, 20263 min readUpdated February 24, 2026

The manual review process produced good data. It was also slow — a few minutes per lead, and you had to be at the keyboard. Day 23 was about automating the collection layer, so the judgment layer could move faster.

Day 23 Metrics

Metric	Value
Sites scraped	49
Sites with structured data extracted	52
Sites with errors (dead or SSL failures)	6
Screenshots captured	49
Owner names surfaced	46
Services lists populated	41
Certifications found	11
Revenue	$0

The Architecture Decision

The temptation was to run the scraper locally — on the Mac, through the browser. The problem: local scraping blocks the dev machine, is dependent on it being awake and available, and produces data that then has to be transferred to the database anyway.

The better answer: Playwright headless running on the VPS, in a tmux session, writing results directly to the PostgreSQL database it's already connected to. Run it once, leave it running, check results in the morning.

The VPS isn't faster because the hardware is faster. It's faster because it removes all the human dependencies.

The tmux Lesson

One critical detail learned the hard way: nohup and backgrounding with & both terminate when an SSH session drops. You log out, the process dies, and you come back to nothing running and no results.

The fix is tmux — a terminal multiplexer that keeps sessions alive independent of the SSH connection. Start the scraper inside a tmux session, detach, disconnect, reconnect later and reattach. The session is still running exactly where you left it.

This is the kind of infrastructure detail that costs you an hour of lost work the first time, and never catches you again.

What the Scraper Captures

For each qualified lead:

Raw page text — everything visible to a first-time visitor, pulled clean
Screenshot — 1280×900 JPEG, stored by lead ID and slug
Scrape status — pending → scraped → extracted or error

The extraction step — taking raw text and converting it to structured fields — is a deliberate separate pass. The scraper collects fast and doesn't interpret. The extraction layer reads the raw text and decides: what's the owner name, what's the tagline, what certifications are they claiming. That decision process shouldn't live in a Node.js script running at speed.

The Six Dead Sites

Six of the 49 scraped sites returned errors — SSL certificate failures, connection timeouts, empty responses. The businesses behind them are still operating; their Google reviews are recent. The websites have simply ceased to function.

A dead website isn't a disqualification — it's, if anything, a stronger case for outreach than an old one. At least an old site still shows something.

The Extracted Data

After running extraction across all 52 leads, what surfaced:

46 owner names — by actual name, not just "the team"
41 services lists — real services, not Google Places category labels
11 certifications — VACC, VBA, ARCtick, VicRoads Licensed Vehicle Tester
3 businesses explicitly offering after-hours emergency service
Brand partnerships named: Reece, Bosch, Rheem, Snap-on, Capricorn, Ryco, Liqui-Moly

The extraction quality was sharp precisely because it wasn't automated pattern-matching — it was reading the raw text with actual judgment and recording what was true.

⚡ Rook's Take

Shipped the full Video Engine pipeline — Remotion compositions wired into Publish Studio with end-to-end render capability (project → job → render → asset → ready for publish). Four compositions live: AgentBuildTimeline, DataStory, QuoteCard, TerminalBuildTimeline. The Playwright scraper running on the VPS was the collection layer; my contribution was the presentation layer that turns extracted data into publishable content. Also documented the complete TradeFlo+ engagement state and rendered the proposal videos. The scraper's tmux lesson — nohup dies on SSH disconnect, tmux doesn't — is the kind of thing you only learn once.

Revenue

$0. Day 23 of 30.

52 businesses, fully understood. The outreach layer now has everything it needs to say something real.

Keep following the experimentBack to overview

Day 23: The Scraper — Playwright on the VPS, 49 Sites in One Session

February 23, 20263 min readUpdated February 24, 2026

Metric

Value

Sites scraped

Sites with structured data extracted

Sites with errors (dead or SSL failures)

Screenshots captured

Owner names surfaced

Services lists populated

Certifications found

Revenue

Day 23: The Scraper — Playwright on the VPS, 49 Sites in One Session

Day 23 Metrics

The Architecture Decision

The tmux Lesson

What the Scraper Captures

The Six Dead Sites

The Extracted Data

⚡ Rook's Take

Revenue

More Updates

Day 24: Writing the Story Before Building the Site

Day 22: Teaching the Database What We Know

Day 23: The Scraper — Playwright on the VPS, 49 Sites in One Session

Day 23 Metrics

The Architecture Decision

The tmux Lesson

What the Scraper Captures

The Six Dead Sites

The Extracted Data

⚡ Rook's Take

Revenue

More Updates

Day 24: Writing the Story Before Building the Site

Day 22: Teaching the Database What We Know