LangChain手册（Python版）19模块：索引中的矢量数据库

Vectorstores 是构建索引最重要的组件之一。

有关矢量存储和通用功能的介绍，请参阅：

入门

我们还提供了所有受支持的向量存储类型的文档。请参阅下面的列表。

AnalyticDB
Annoy
Atlas
Chroma
Deep Lake
DocArrayHnswSearch
DocArrayInMemorySearch
ElasticSearch
FAISS
LanceDB
Milvus
MyScale
OpenSearch
PGVector
Pinecone
Qdrant
Redis
Supabase (Postgres)
Tair
Typesense
Vectara
Weaviate
Persistance
Retriever options
Zilliz

开始

此笔记本展示了与 VectorStores 相关的基本功能。使用矢量存储的一个关键部分是创建要放入其中的矢量，这通常是通过嵌入创建的。因此，建议您在深入研究之前熟悉嵌入笔记本。

这涵盖了与所有向量存储相关的通用高级功能。

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

with open('../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

embeddings = OpenAIEmbeddings()

docsearch = Chroma.from_texts(texts, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

print(docs[0].page_content)

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

添加文本

您可以使用该方法轻松地将文本添加到 vectorstore add_texts。它将返回文档 ID 列表（以防您需要在下游使用它们）。

docsearch.add_texts(["Ankush went to Princeton"])

['a05e3d0c-ab40-11ed-a853-e65801318981']

query = "Where did Ankush go to college?"
docs = docsearch.similarity_search(query)

docs[0]

Document(page_content='Ankush went to Princeton', lookup_str='', metadata={}, lookup_index=0)

来自文件

我们也可以直接从文档初始化一个 vectorstore。当我们使用文本拆分器上的方法直接获取文档时，这很有用（当原始文档具有关联的元数据时很方便）。

documents = text_splitter.create_documents([state_of_the_union], metadatas=[{"source": "State of the Union"}])

docsearch = Chroma.from_documents(documents, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

print(docs[0].page_content)

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen.

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

AnalyticDB

AnalyticDB for PostgreSQL是一种大规模并行处理 (MPP) 数据仓库服务，旨在在线分析大量数据。

AnalyticDB for PostgreSQL基于开源项目开发，并通过. AnalyticDB for PostgreSQL 与 ANSI SQL 2003 语法以及 PostgreSQL 和 Oracle 数据库生态系统兼容。AnalyticDB for PostgreSQL 还支持行存储和列存储。AnalyticDB for PostgreSQL 高性能离线处理PB级数据，支持在线高并发查询。Greenplum DatabaseAlibaba Cloud

此笔记本展示了如何使用与AnalyticDB矢量数据库相关的功能。要运行，您应该启动并运行AnalyticDB实例：

使用AnalyticDB 云矢量数据库。单击此处快速部署它。

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import AnalyticDB

通过调用 OpenAI API 拆分文档并获取嵌入

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

通过设置相关的ENVIRONMENTS连接到AnalyticDB。

export PG_HOST={your_analyticdb_hostname}
export PG_PORT={your_analyticdb_port} # Optional, default is 5432
export PG_DATABASE={your_database} # Optional, default is postgres
export PG_USER={database_username}
export PG_PASSWORD={database_password}

然后将您的嵌入和文档存储到 AnalyticDB 中

import os

connection_string = AnalyticDB.connection_string_from_db_params(
    driver=os.environ.get("PG_DRIVER", "psycopg2cffi"),
    host=os.environ.get("PG_HOST", "localhost"),
    port=int(os.environ.get("PG_PORT", "5432")),
    database=os.environ.get("PG_DATABASE", "postgres"),
    user=os.environ.get("PG_USER", "postgres"),
    password=os.environ.get("PG_PASSWORD", "postgres"),
)

vector_db = AnalyticDB.from_documents(
    docs,
    embeddings,
    connection_string= connection_string,
)

查询和检索数据

query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)

print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Chroma

Chroma是一个用于构建带有嵌入的 AI 应用程序的数据库。

此笔记本展示了如何使用与Chroma矢量数据库相关的功能。

!pip install chromadb

# get a token: https://platform.openai.com/account/api-keys

from getpass import getpass

OPENAI_API_KEY = getpass()

import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

Using embedded DuckDB without persistence: data will be transient

print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

带分数的相似性搜索

docs = db.similarity_search_with_score(query)

docs[0]

(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),
 0.3949805498123169)

持久化

以下步骤涵盖了如何持久化 ChromaDB 实例

初始化 PeristedChromaDB

为每个块创建嵌入并插入色度矢量数据库。persist_directory 参数告诉 ChromaDB 在持久化时将数据库存储在何处。

# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)

Running Chroma using direct local API.
No existing DB found in db, skipping load
No existing DB found in db, skipping load

持久化数据库

我们应该调用 persist() 以确保将嵌入写入磁盘。

vectordb.persist()
vectordb = None

Persisting DB to disk, putting it in the save folder db
PersistentDuckDB del, about to run persist
Persisting DB to disk, putting it in the save folder db

从磁盘加载数据库，并创建链#

确保传递与实例化数据库时相同的 persist_directory 和 embedding_function。初始化我们将用于问答的链。

# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

Running Chroma using direct local API.
loaded in 4 embeddings
loaded in 1 collections

Pinecone

Pinecone是一个具有广泛功能的矢量数据库。

此笔记本展示了如何使用与Pinecone矢量数据库相关的功能。

要使用 Pinecone，您必须有一个 API 密钥。这是安装说明。

!pip install pinecone-client

import os
import getpass

PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')

PINECONE_ENV = getpass.getpass('Pinecone Environment:')

我们要使用OpenAIEmbeddings，所以我们必须获得 OpenAI API 密钥。

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import TextLoader

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

import pinecone 

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

index_name = "langchain-demo"

docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

# if you already have an index, you can load it like this
# docsearch = Pinecone.from_existing_index(index_name, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)