How Breakpoint aggregates tech blogs

Scraping Process

How the scraper works

CADENCE

Full scrapes run roughly every three days. When a new publisher is approved, an on-demand run kicks off so their posts show up right away.

RSS FIRST, SCRAPING SECOND

If a publisher exposes an RSS or Atom feed, we use it — it's faster, cheaper, and easier on their servers. For sites without a feed, we render and parse the page directly.

AI EXTRACTION

An LLM extracts the title, author, date, description, and tags from each post. It also flags content that's off-topic or out of scope so it never reaches the Feed.

DEDUPLICATION

Every post gets a content hash. If we've already ingested it — even under a slightly different URL — we skip it.

STORAGE & RANKING

New posts land in our database with embeddings for semantic search. The Feed then ranks them using recency, popularity, and your subscriptions.

Troubleshooting