for builders
Read primary sources without slowing down the loop.
Docs, READMEs, changelogs, pricing pages, and competitor copy become local artifacts agents can reuse.
local-first web evidence capture
BitterCrawl is the web intake valve for Bitter: fetch a page, extract clean markdown, preserve the source, and emit a crawl receipt every future run can audit. No account, API key, credit meter, or hosted toll booth in the default path.
01 / why
Agents need to open docs, inspect competitors, read changelogs, verify pricing pages, compare examples, and preserve evidence. That should be as ordinary as reading a local file or checking Git.
Hosted scrape APIs are useful escalation paths. They should not be the default economics of Bitter's research loop. BitterCrawl starts with local fetch and extraction, then records what happened so a later run can reuse and audit the source.
for builders
Docs, READMEs, changelogs, pricing pages, and competitor copy become local artifacts agents can reuse.
for operators
Every important fetch leaves behind the URL, timestamp, hashes, and files needed to reconstruct the claim.
Normal public pages should not require an account, API key, credit balance, or hosted dependency.
The durable output is not just markdown. It is a proof path: URL, fetch, artifacts, policy, and hashes.
Static fetch first. Crawl4AI, browser rendering, and hosted providers become explicit adapters.
02 / cli
The first wedge is deliberately narrow: one URL in, markdown and a receipt out. Human commands and agent method calls should converge on the same `crawl.read` primitive.
default read
bitter crawl example.com
Fetch a page, extract main content, write local artifacts, and print useful markdown.
agent envelope
bitter crawl example.com --json
Return `bitter.crawl.result.v0` for agents, scripts, and future method graph calls.
explicit custody
bitter crawl docs.example.com --out .bitter/crawl/runs
Control where run directories, raw HTML, headers, markdown, and receipts are written.
adapter choice
bitter crawl app.example.com --provider browser
Escalate only when a page needs rendering, interaction, or a configured hosted provider.
03 / receipt
A useful crawl should answer the questions a future agent will ask: what URL was read, when, by which adapter, what changed on redirect, what was saved, and which hashes identify the evidence.
why this matters
Summaries are cheap to produce and easy to misremember. Receipts make a web claim reconstructable without trusting the agent's prose.
04 / path
Native fetch, Readability, Turndown, markdown, raw artifacts, receipt.
Optional local adapter for richer extraction when users install it.
Crawlee or Playwright for rendered pages, screenshots, and sessions.
Firecrawl, Jina, or Bitter-hosted workers as configured fallback paths.
05 / faq
Not first. The first version is a local-first CLI primitive that fetches a page, extracts markdown, writes artifacts, and emits a receipt. Hosted crawling belongs behind explicit provider flags.
It is a different default. Firecrawl is a useful compatibility adapter and benchmark. BitterCrawl is the custody layer: evidence, cache discipline, source artifacts, and receipts for Bitter's agent loop.
The local-static adapter covers normal documents first. Browser rendering, Crawl4AI, and hosted providers become explicit escalation paths when static fetch is not enough.
06 / first wedge
BitterCrawl Cloud can wait. The immediate product promise is simple: every agent can read a public page and leave a receipt.
$ bitter crawl example.com --json