Upserting documents

The upsertDocuments method adds or updates multiple documents in the catalog in a single operation. It supports various types of documents, including text, JSON, files, URLs, and sitemaps.

Supported document types

TextDocument: For inline text or markdown content
JSONDocument: For inline JSON content
FileDocument: For file-based content (.docx, .md, .mdx, and .txt)
UrlDocument: For web page content
SitemapDocument: For scraping entire sitemap URLs

Parameters

batch: DocumentBatch - An array of documents to be upserted. All documents in the batch must have the same content type.

Returns

A Promise that resolves when the upsert operation is complete.

TextDocument

Upserting inline markdown:

const catalog = await client.getCatalog("github-markdown");
 
const docs: TextDocument[] = [
  {
    documentId: "1",
    contentType: "markdown",
    content: "# some markdown",
    url: "https://foo.com",
    imageUrl: "https://foo.com/image.jpg",
  },
  {
    documentId: "2",
    contentType: "markdown",
    content: "# some more markdown",
    url: "https://foo.com/2",
    imageUrl: "https://foo.com/image2.jpg",
  },
];
 
await catalog.upsertDocuments(docs);

Upserting inline text:

const catalog = await client.getCatalog("text-catalog");
 
const docs: TextDocument[] = [
  {
    documentId: "1",
    contentType: "text",
    content: "some plain text",
    url: "https://foo.com",
    imageUrl: "https://foo.com/image.jpg",
  },
  {
    documentId: "2",
    contentType: "text",
    content: "some more plain text",
    url: "https://foo.com/2",
    imageUrl: "https://foo.com/image2.jpg",
  },
];
 
await catalog.upsertDocuments(docs);

JSONDocument

JSON objects can be individually uploaded via batch upsert. For bulk JSON ingestion of JSON arrays, use the JSON indexer.

const catalog = await client.getCatalog("json");
const docs: JSONDocument[] = [
  {
    documentId: "1",
    contentType: "json",
    content: {
      foo: "buzz",
      a: [5, 6, 7],
    },
    url: "https://foo.com",
    imageUrl: "https://foo.com/image.jpg",
  },
  {
    documentId: "2",
    contentType: "json",
    content: {
      foo: "bar",
      a: [1, 2, 3],
    },
    url: "https://foo.com/2",
    imageUrl: "https://foo.com/image2.jpg",
  },
];
 
await catalog.upsertDocuments(docs);

FileDocument

Upload .txt, .md, .mdx or .docx files:

const docs: FileDocument[] = [
  {
    documentId: "1",
    contentType: "file",
    filePath: "./brand-guidelines.md",
    url: "https://foo.com",
    imageUrl: "https://foo.com/image.jpg",
  },
  {
    documentId: "2",
    contentType: "file",
    filePath: "./customer-testimonials.docx",
    url: "https://foo.com/2",
    imageUrl: "https://foo.com/image2.jpg",
  },
];
 
await catalog.upsertDocuments(docs);

UrlDocument

Upsert one or more URLs for web scraping. Upserting URLs returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: UrlDocument[] = [
  {
    url: "https://www.cortexclick.com/",
    contentType: "url",
  },
];
 
await catalog.upsertDocuments(docs);

SitemapDocument

Upsert one or more sitemap documents to scrape and index an entire website. Sitemaps and sitemap indexes will be recursively traversed. Upserting sitemaps returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: SitemapDocument[] = [
  {
    sitemapUrl: "https://www.cortexclick.com/sitemap.xml",
    contentType: "sitemap-url",
  },
];
 
await catalog.upsertDocuments(docs);