Skip to main content

Pinecone

This connector materializes Flow collections into namespaces in a Pinecone index.

The connector uses the OpenAI Embedding API to create vector embeddings based on the documents in your collections and inserts these vector embeddings and associated metadata into Pinecone for storage and retrieval.

ghcr.io/estuary/materialize-pinecone:dev provides the latest connector image. You can also follow the link in your browser to see past image versions.

Prerequisites

To use this connector, you'll need:

  • A Pinecone account with an API Key for authentication.
  • An OpenAI account with an API Key for authentication.
  • A Pinecone Index created to store materialized vector embeddings. When using the embedding model text-embedding-ada-002 (recommended), the index must have Dimensions set to 1536.

Embedding Input

The materialization creates a vector embedding for each collection document. Its structure is based on the collection fields.

By default, fields of a single scalar type are including in the embedding: strings, integers, numbers, and booleans. You can include additional array or object type fields using projected fields.

The text generated for the embedding has this structure, with field names and their values separated by newlines:

stringField: stringValue
intField: 3
numberField: 1.2
boolField: false

Pinecone Record Metadata

Pinecone supports metadata fields associated with stored vectors that can be used when performing vector queries. This materialization will include the materialized document as a JSON string in the metadata field flow_document to enable retrieval of the document from vectors returned by Pinecone queries.

Pinecone indexes all metadata fields by default. To manage memory usage of the index, use selective metadata indexing to exclude the flow_document metadata field.

Properties

Endpoint

PropertyTitleDescriptionTypeRequired/Default
/indexPinecone IndexPinecone index for this materialization. Must already exist and have appropriate dimensions for the embedding model used.stringRequired
/environmentPinecone EnvironmentCloud region for your Pinecone project. Example: us-central1-gcpstringRequired
/pineconeApiKeyPinecone API KeyPinecone API key used for authentication.stringRequired
/openAiApiKeyOpenAI API KeyOpenAI API key used for authentication.stringRequired
/embeddingModelEmbedding Model IDEmbedding model ID for generating OpenAI bindings. The default text-embedding-ada-002 is recommended.string"text-embedding-ada-002"
/advancedOptions for advanced users. You should not typically need to modify these.object
/advaned/openAiOrgOpenAI OrganizationOptional organization name for OpenAI requests. Use this if you belong to multiple organizations to specify which organization is used for API requests.string

Bindings

PropertyTitleDescriptionTypeRequired/Default
/namespacePinecone NamespaceName of the Pinecone namespace that this collection will materialize vectors into.stringRequired

Sample

materializations:
${PREFIX}/${mat_name}:
endpoint:
connector:
image: "ghcr.io/estuary/materialize-pinecone:dev"
config:
index: your-index
environment: us-central1-gcp
pineconeApiKey: <YOUR_PINECONE_API_KEY>
openAiApiKey: <YOUR_OPENAI_API_KEY>
bindings:
- resource:
namespace: your-namespace
source: ${PREFIX}/${COLLECTION_NAME}

Delta Updates

This connector operates only in delta updates mode.

Pinecone upserts vectors based on their id. The id for materialized vectors is based on the Flow Collection key.

For collections with a a top-level reduction strategy of merge and a strategy of lastWriteWins for all nested values (this is also the default), collections will be materialized "effectively once", with any updated Flow documents replacing vectors in the Pinecone index if they have the same key.