Web scraping

Cortex Click provides a web scraper that is optimized for parsing and cleaning data for LLM consumption. It automatically transforms content into markdown, cleans redundant sections like nav headers and sidebars, and resolves images and links to fully qualified paths. This enables the intelligent content engine to insert images and citations from your website automatically.

Scraping individual URLs

Upsert one or more URLs for web scraping. Upserting URLs returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: UrlDocument[] = [
  {
    url: "https://www.cortexclick.com/",
    contentType: "url",
  },
];
 
await catalog.upsertDocuments(docs);

Scraping sitemaps

Upsert one or more sitemap documents to scrape and index an entire website. Sitemaps and sitemap indexes will be recursively traversed. Upserting sitemaps returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: SitemapDocument[] = [
  {
    sitemapUrl: "https://www.cortexclick.com/sitemap.xml",
    contentType: "sitemap-url",
  },
];
 
await catalog.upsertDocuments(docs);

Roadmap

Support for running web scraping on a schedule, and triggering scraping through the app is in active development.