How the scraper works
CADENCE
Full scrapes run roughly every three days. When a new publisher is approved, an on-demand run kicks off so their posts show up right away.
RSS FIRST, SCRAPING SECOND
If a publisher exposes an RSS or Atom feed, we use it — it's faster, cheaper, and easier on their servers. For sites without a feed, we render and parse the page directly.
AI EXTRACTION
An LLM extracts the title, author, date, description, and tags from each post. It also flags content that's off-topic or out of scope so it never reaches the Feed.
DEDUPLICATION
Every post gets a content hash. If we've already ingested it — even under a slightly different URL — we skip it.
STORAGE & RANKING
New posts land in our database with embeddings for semantic search. The Feed then ranks them using recency, popularity, and your subscriptions.
Troubleshooting
RECENT POST IS MISSING
Open Manage Blog and trigger a manual rescrape. You can rescrape your publisher once per hour. Each run shows up under Recent Scrapes with how many posts were found and inserted.
STILL MISSING AFTER A RESCRAPE
Sometimes the scraper can't pick a post up — unusual page layouts, posts behind redirects, or a feed that isn't listing it. Use Add a missing post on Manage Blog to submit it manually with title, URL, description, and tags.
A SCRAPE KEEPS FAILING
Check the most recent run on Manage Blog — failures include the underlying error. Common causes: blog moved domains, RSS feed changed shape, or the site started rate-limiting requests.
STILL NEED HELP
Reach out to us directly and we'll dig in.
Have a blog you'd like us to add? Submit it from the About page or your Dashboard.