Cortex Click provides a web scraper that is optimized for parsing and cleaning data for LLM consumption.
It automatically transforms content into markdown, cleans redundant sections like nav headers and sidebars, and resolves images and links to fully qualified paths.
This enables the intelligent content engine to insert images and citations from your website automatically.
Scraping individual URLs
Upsert one or more URLs for web scraping. Upserting URLs returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.
Scraping sitemaps
Upsert one or more sitemap documents to scrape and index an entire website. Sitemaps and sitemap indexes will be recursively traversed. Upserting sitemaps returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.
Automatic indexing using web scraper indexer
From a catalog page, you can create a web scraper indexer to automatically update your content. Indexers can run on daily, weekly, or monthly. You can also run an indexer on demand at any time.
Scroll down to the bottom of the page to find the Indexers section, and choose "Create indexer to scrape urls and sitemaps" option.
You can also create, run, and manage indexers using the SDK: