Pinecone
This connector materializes Flow collections into namespaces in a Pinecone index.
The connector uses the OpenAI Embedding API to create vector embeddings based on the documents in your collections and inserts these vector embeddings and associated metadata into Pinecone for storage and retrieval.
ghcr.io/estuary/materialize-pinecone:dev
provides the latest connector image. You can also follow the link in your browser to see past image
versions.
Prerequisites
To use this connector, you'll need:
- A Pinecone account with an API Key for authentication.
- An OpenAI account with an API Key for authentication.
- A Pinecone Index created to store materialized vector
embeddings. When using the embedding model
text-embedding-ada-002
(recommended), the index must haveDimensions
set to 1536.
Embedding Input
The materialization creates a vector embedding for each collection document. Its structure is based on the collection fields.
By default, fields of a single scalar type are including in the embedding: strings, integers, numbers, and booleans. You can include additional array or object type fields using projected fields.
The text generated for the embedding has this structure, with field names and their values separated by newlines:
stringField: stringValue
intField: 3
numberField: 1.2
boolField: false
Pinecone Record Metadata
Pinecone supports metadata fields associated with stored vectors that can be used when performing
vector queries. This materialization will
include the materialized document as a JSON string in the metadata field flow_document
to enable
retrieval of the document from vectors returned by Pinecone queries.
Pinecone indexes all metadata fields by default. To manage memory usage of the index, use selective
metadata indexing to
exclude the flow_document
metadata field.
Properties
Endpoint
Property | Title | Description | Type | Required/Default |
---|---|---|---|---|
/index | Pinecone Index | Pinecone index for this materialization. Must already exist and have appropriate dimensions for the embedding model used. | string | Required |
/environment | Pinecone Environment | Cloud region for your Pinecone project. Example: us-central1-gcp | string | Required |
/pineconeApiKey | Pinecone API Key | Pinecone API key used for authentication. | string | Required |
/openAiApiKey | OpenAI API Key | OpenAI API key used for authentication. | string | Required |
/embeddingModel | Embedding Model ID | Embedding model ID for generating OpenAI bindings. The default text-embedding-ada-002 is recommended. | string | "text-embedding-ada-002" |
/advanced | Options for advanced users. You should not typically need to modify these. | object | ||
/advaned/openAiOrg | OpenAI Organization | Optional organization name for OpenAI requests. Use this if you belong to multiple organizations to specify which organization is used for API requests. | string |
Bindings
Property | Title | Description | Type | Required/Default |
---|---|---|---|---|
/namespace | Pinecone Namespace | Name of the Pinecone namespace that this collection will materialize vectors into. | string | Required |
Sample
materializations:
${PREFIX}/${mat_name}:
endpoint:
connector:
image: "ghcr.io/estuary/materialize-pinecone:dev"
config:
index: your-index
environment: us-central1-gcp
pineconeApiKey: <YOUR_PINECONE_API_KEY>
openAiApiKey: <YOUR_OPENAI_API_KEY>
bindings:
- resource:
namespace: your-namespace
source: ${PREFIX}/${COLLECTION_NAME}
Delta Updates
This connector operates only in delta updates mode.
Pinecone upserts vectors based on their id
. The id
for materialized vectors is based on the Flow Collection key.
For collections with a a top-level reduction strategy of merge and a strategy of lastWriteWins for all nested values (this is also the default), collections will be materialized "effectively once", with any updated Flow documents replacing vectors in the Pinecone index if they have the same key.