docs-llamaindex-ai...
docs-llamaindex-ai...
This walkthrough shows how to use Redis for both the vector store, cache, and docstore in an Ingestion Pipeline.
Dependencies
Install and start redis, setup OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["TOKENIZERS_PARALLELISM"] = "false"
However, if you only want to handle duplcates, you can change the strategy to DUPLICATES_ONLY .
Ask AI
from llama_index.embeddings.huggingface
stable import HuggingFaceEmbedding
from llama_index.core.ingestion import (
DocstoreStrategy,
IngestionPipeline,
IngestionCache,
)
from llama_index.storage.kvstore.redis import RedisKVStore as RedisCache
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.redis import RedisVectorStore
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
custom_schema = IndexSchema.from_dict(
{
"index": {"name": "redis_vector_store", "prefix": "doc"},
# customize fields that are indexed
"fields": [
# required fields for llamaindex
{"type": "tag", "name": "id"},
{"type": "tag", "name": "doc_id"},
{"type": "text", "name": "text"},
# custom vector field for bge-small-en-v1.5 embeddings
{
"type": "vector",
"name": "vector",
"attrs": {
"dims": 384,
"algorithm": "hnsw",
"distance_metric": "cosine",
},
},
],
}
)
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
embed_model,
],
docstore=RedisDocumentStore.from_host_and_port(
"localhost", 6379, namespace="document_store"
),
vector_store=RedisVectorStore(
schema=custom_schema,
redis_url="redis://localhost:6379",
),
cache=IngestionCache(
cache=RedisCache.from_host_and_port("localhost", 6379),
collection="redis_cache",
),
docstore_strategy=DocstoreStrategy.UPSERTS,
)
nodes = pipeline.run(documents=documents)
print(f"Ingested {len(nodes)} Nodes")
Ingested 2 Nodes
index = VectorStoreIndex.from_vector_store(
pipeline.vector_store, embed_model=embed_model
)
print(
index.as_query_engine(similarity_top_k=10).query(
"What documents do you see?"
)
)
I see two documents.
documents = SimpleDirectoryReader(
"./test_redis_data", filename_as_id=True
).load_data()
nodes = pipeline.run(documents=documents)
print(f"Ingested {len(nodes)} Nodes")
13:32:07 redisvl.index.index INFO Index already exists, not overwriting.
Ingested 2 Nodes
index = VectorStoreIndex.from_vector_store(
pipeline.vector_store, embed_model=embed_model
)
response = index.as_query_engine(similarity_top_k=10).query(
"What documents do you see?"
)
print(response)
As we can see, the data was deduplicated and upserted correctly! Only three nodes are in the index, even though we
ran the full pipeline twice.
Previous Next
Parallelizing Ingestion Pipeline AI21