Weaviate is a robust, open-source vector database that allows you to store and query data based on its meaning. It supports various modules for text, image, and multimodal vectorization, enabling semantic search, advanced filtering, and question-answering. Weaviate offers flexible deployment options and integrates seamlessly with popular machine learning models and frameworks, providing GraphQL, REST, and gRPC APIs for easy integration with your applications.
Key Features
Semantic Search: Store and query data based on semantic meaning, going beyond keyword matching.
Modular Architecture: Extend Weaviate's functionality with various modules for different data types and tasks.
High Performance: Optimized for speed and scalability to handle large datasets and complex queries.
Multiple APIs: GraphQL, REST, and gRPC APIs provide flexible integration options for your applications.
Horizontal Scalability: Easily scale Weaviate to handle growing data and query loads.
Installation and Deployment on Apolo
You can deploy Weaviate using Apolo, which facilitates Helm chart deployment and integrates with other applications running on the platform. This simplifies deployment and management, allowing for easy customization and integration with your existing infrastructure.
Persistent Storage: Automatically provisions persistent storage for your Weaviate data.
Ingress Configuration: Configure ingress for external access to Weaviate's APIs.
Cluster API Authentication: Set up authentication for Weaviate's cluster API.
Backups: Configure backups to an Apolo bucket.
Parameter Descriptions
The following parameters can be set when deploying Weaviate using the Apolo CLI:
Parameter
Type
Description
app_name
String
Required. The name of your Weaviate application (used to name Kubernetes resources). Must adhere to Kubernetes naming conventions. Example: weaviate.
preset_name
String
Required. The name of the Apolo preset to use for resource allocation (e.g., cpu-small, gpu-medium). Determines CPU, memory, and GPU resources. Example: cpu-large.
persistence.size
String
Optional (default: 32Gi). The size of the persistent volume claim for Weaviate's data. Example: 64Gi.
ingress.enabled
Boolean
Optional (default: false). Enables ingress for external access to Weaviate's HTTP and gRPC APIs. Example: true.
ingress.clusterName
String
Optional (default: weaviate). The cluster name for ingress (used in the generated hostname). Only relevant if ingress is enabled. Example: cl1.
ingress.grpc.enabled
Boolean
Optional (default: false). Enable ingress for external access to Weaviate gRPC APIs specifically. Example: true.
clusterApi.username
String
Optional. Username for Weaviate's cluster API. If not specified, it is automatically generated and stored as a secret. Example: taddeus
clusterApi.password
String
Optional. Password for Weaviate's cluster API. If not specified, it is automatically generated and stored as a secret. Example: 31n81tSIc$7il4Js
authentication.enabled
Boolean
Optional (default: false). Enable or disable client authentication. If not set or false, API key authentication must be configured. Example: true.
backups.enabled
Boolean
Optional (default: false). Enable or disable data backups. If enabled, the bucket is created with the name weaviate-backup by default. Example: true.
Embedding modules are not available out of the box; for now, embeddings must be generated externally with an embedding model of your choice and can be saved in Weaviate.
This example demonstrates connecting to Weaviate, defining a schema, embedding documents using the NV-Embed-v2 model, storing them in Weaviate, and performing a similarity search.
import torchimport torch.nn.functional as Ffrom transformers import AutoTokenizer, AutoModelimport weaviate# Step 1: Connect to Weaviateclient = weaviate.Client( url="<your-ingress-endpoint>", auth_client_secret=weaviate.AuthApiKey(api_key="<your-cluster-api-password>"))if client.is_ready():print("Connected to Weaviate!")else:print("Weaviate is not ready.")exit(1)# Step 2: Define a schema class in Weaviateschema_class ={"class":"Document","description":"A collection of documents for testing embeddings","vectorizer":"none",# We will provide our own vectors"properties": [{"name":"title","description":"Title of the document","dataType": ["text"],},{"name":"content","description":"Content of the document","dataType": ["text"],}, ],}# Check if the class already existsexisting_classes = client.schema.get()['classes']class_names = [c['class']for c in existing_classes]if"Document"notin class_names: client.schema.create_class(schema_class)print("Schema 'Document' created.")else:print("Schema 'Document' already exists.")# Step 3: Load NV-Embed-v2 modelmodel_name ="nvidia/NV-Embed-v2"print("Loading NV-Embed-v2 model...")model = AutoModel.from_pretrained(model_name, trust_remote_code=True)tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)print("Model loaded.")# Step 4: Prepare a datasetdocuments = [{"title":"The Impact of Climate Change","content":"Climate change affects weather patterns, sea levels, and ecosystems."},{"title":"Artificial Intelligence in Healthcare","content":"AI is revolutionizing diagnostics and treatment plans in healthcare."},{"title":"Advancements in Quantum Computing","content":"Quantum computers use quantum bits to perform complex calculations."},{"title":"Renewable Energy Sources","content":"Solar and wind energy are key to reducing carbon emissions."},{"title":"The Basics of Machine Learning","content":"Machine learning enables computers to learn from data."},{"title":"Ocean Conservation Efforts","content":"Protecting marine life is essential for ecological balance."},{"title":"Blockchain Technology Explained","content":"Blockchain provides a decentralized ledger for transactions."},{"title":"The Human Immune System","content":"The immune system defends the body against infections."},{"title":"Exploring the Solar System","content":"Mars rovers are providing new insights about the red planet."},{"title":"History of the Internet","content":"The internet has transformed communication and information sharing."},]# Step 5: Generate embeddings for the documentsdefgenerate_embeddings(texts,prefix=""): max_length =32768# Adjust as needed inputs = [prefix + text + tokenizer.eos_token for text in texts]with torch.no_grad(): embeddings = model.encode(inputs, instruction=prefix, max_length=max_length) embeddings = F.normalize(embeddings, p=2, dim=1)return embeddings# Since we are encoding documents, no instruction prefix is neededprint("Generating embeddings for documents...")doc_texts = [doc["content"]for doc in documents]doc_embeddings =generate_embeddings(doc_texts)print("Document embeddings generated.")# Step 6: Store documents and embeddings in Weaviateprint("Adding documents to Weaviate...")for doc, embedding inzip(documents, doc_embeddings):# Convert the embedding tensor to a list embedding_list = embedding.tolist()# Add the document to Weaviate with the vector client.data_object.create( data_object=doc, class_name="Document", vector=embedding_list )print("Documents added to Weaviate.")# Step 7: Perform a similarity searchquery_text ="How does renewable energy help combat climate change?"query_prefix ="Instruct: Given a question, retrieve passages that answer the question\nQuery: "print("Generating embedding for the query...")query_embedding =generate_embeddings([query_text], prefix=query_prefix)[0]query_embedding_list = query_embedding.tolist()print("Query embedding generated.")print("Performing similarity search in Weaviate...")# Use the 'nearVector' filter to find similar documentsresult = ( client.query.get("Document", ["title", "content"]).with_near_vector({"vector": query_embedding_list}).with_limit(3).do())print("Search results:")for idx, res inenumerate(result["data"]["Get"]["Document"], start=1): title = res.get("title", "No Title") content = res.get("content", "No Content")print(f"\nResult {idx}:")print(f"Title: {title}")print(f"Content: {content}")
This script demonstrates connecting to Weaviate, defining a schema, embedding documents using OpenAI embeddings, storing them in Weaviate via LlamaIndex, and performing a similarity search.\
import osfrom llama_index.core import ( VectorStoreIndex, Document, StorageContext, ServiceContext)from llama_index.vector_stores.weaviate import WeaviateVectorStorefrom llama_index.embeddings.openai import OpenAIEmbeddingimport weaviateimport openaios.environ["OPENAI_API_KEY"]="<your-api-key>"# Step 1: Set OpenAI API keyopenai.api_key = os.environ.get("OPENAI_API_KEY")ifnot openai.api_key:print("Please set the OPENAI_API_KEY environment variable.")exit(1)# Step 2: Connect to Weaviate using the v3 clientweaviate_url ="<your-ingress-endpoint>"client = weaviate.Client(url=weaviate_url, auth_client_secret=weaviate.AuthApiKey(api_key="<your-cluster-api-password>"))if client.is_ready():print("Connected to Weaviate!")else:print("Failed to connect to Weaviate!")exit(1)# Step 3: Define schema class in WeaviateCOLLECTION_NAME ="LlamaDocument"# Using capital letter as requiredschema ={"class": COLLECTION_NAME,"description":"A collection of documents for testing embeddings","vectorizer":"none",# We will provide our own vectors via OpenAI"properties": [{"name":"title","dataType": ["text"],"description":"Title of the document",},{"name":"content","dataType": ["text"],"description":"Content of the document",}, ]}# Check if the class already existsexisting_classes = client.schema.get()['classes']class_names = [c['class']for c in existing_classes]if COLLECTION_NAME notin class_names: client.schema.create_class(schema)print(f"Schema '{COLLECTION_NAME}' created.")else:print(f"Schema '{COLLECTION_NAME}' already exists.")# Step 3.1: Monkey-Patch the class_schema_exists Function# This is necessary because llama_index's WeaviateVectorStore uses a v3 method that doesn't exist in v4from llama_index.vector_stores.weaviate.utils import class_schema_existsimport llama_index.vector_stores.weaviate.utils as weav_utilsdefnew_class_schema_exists(client,class_name):return client.schema.contains(class_name)weav_utils.class_schema_exists = new_class_schema_exists# Step 4: Prepare the datasetdocuments = [{"title":"The Impact of Climate Change","content":"Climate change affects weather patterns, sea levels, and ecosystems."},{"title":"Artificial Intelligence in Healthcare","content":"AI is revolutionizing diagnostics and treatment plans in healthcare."},{"title":"Advancements in Quantum Computing","content":"Quantum computers use quantum bits to perform complex calculations."},{"title":"Renewable Energy Sources","content":"Solar and wind energy are key to reducing carbon emissions."},{"title":"The Basics of Machine Learning","content":"Machine learning enables computers to learn from data."},{"title":"Ocean Conservation Efforts","content":"Protecting marine life is essential for ecological balance."},{"title":"Blockchain Technology Explained","content":"Blockchain provides a decentralized ledger for transactions."},{"title":"The Human Immune System","content":"The immune system defends the body against infections."},{"title":"Exploring the Solar System","content":"Mars rovers are providing new insights about the red planet."},{"title":"History of the Internet","content":"The internet has transformed communication and information sharing."},{"title":"The Evolution of Space Exploration","content":"Human missions to Mars and beyond are shaping the future of space exploration."},{"title":"The Role of Genetics in Medicine","content":"Genetic research is unlocking personalized treatments and therapies."},{"title":"Cybersecurity in the Digital Age","content":"Protecting sensitive data is critical as cyber threats evolve."},{"title":"The Importance of Mental Health Awareness","content":"Raising awareness about mental health promotes early intervention and support."},{"title":"Breakthroughs in Renewable Energy Storage","content":"Innovative batteries are making renewable energy more reliable."},{"title":"Exploring the Deep Ocean","content":"Underwater exploration reveals new species and ecosystems."},{"title":"The Science of Sleep","content":"Understanding sleep cycles is key to improving health and productivity."},{"title":"The Future of Urban Transportation","content":"Electric vehicles and smart infrastructure are transforming city transit."},{"title":"The Ethics of Artificial Intelligence","content":"AI raises important questions about privacy, fairness, and accountability."},{"title":"The Rise of Virtual Reality","content":"VR is revolutionizing gaming, training, and immersive experiences."},{"title":"The Power of Microorganisms","content":"Microbes play a crucial role in agriculture, medicine, and industry."},{"title":"The History of Renewable Energy","content":"From windmills to solar panels, renewable energy has evolved significantly."},{"title":"Exploring the Arctic","content":"The Arctic holds clues to understanding climate change and global ecosystems."},{"title":"The Impact of Social Media","content":"Social media shapes communication, relationships, and public discourse."},{"title":"The Basics of Cryptocurrency","content":"Cryptocurrencies use blockchain technology for secure digital transactions."},{"title":"The Wonders of Human Brain","content":"Neuroscience is uncovering how the brain processes information and emotions."},{"title":"Innovations in Agriculture","content":"Precision farming and biotechnology are boosting crop yields."},{"title":"Understanding Climate Resilience","content":"Building climate-resilient communities is crucial in adapting to change."},{"title":"The Future of Artificial Intelligence","content":"Advances in AI are shaping industries and daily life."},{"title":"The Mysteries of Black Holes","content":"Black holes challenge our understanding of physics and the universe."},{"title":"The Secrets of Ancient Civilizations","content":"Archaeological discoveries reveal insights into ancient cultures and traditions."},{"title":"The Role of Nanotechnology in Medicine","content":"Nanotechnology is enabling precise drug delivery and advanced treatments."},{"title":"The Impact of 5G Technology","content":"5G networks are transforming communication and powering IoT advancements."},{"title":"Renewable Energy in Urban Planning","content":"Cities are integrating solar and wind energy to promote sustainability."},{"title":"The Psychology of Motivation","content":"Understanding intrinsic and extrinsic motivation can enhance personal achievement."},{"title":"The Importance of Biodiversity","content":"Diverse ecosystems are vital for maintaining balance in nature."},{"title":"The Science of Artificial Photosynthesis","content":"Artificial photosynthesis holds potential for renewable energy production."},{"title":"Advances in Autonomous Vehicles","content":"Self-driving technology is reshaping transportation and logistics."},{"title":"The History of Electric Cars","content":"Electric vehicles have evolved from early prototypes to modern innovations."},{"title":"Exploring Exoplanets","content":"Scientists are discovering Earth-like planets in distant solar systems."},{"title":"The Future of Food Technology","content":"Lab-grown meat and vertical farming are addressing global food demands."},{"title":"The Importance of Water Conservation","content":"Efficient water use is essential to combat scarcity and climate change."},{"title":"The Evolution of Artificial Intelligence","content":"AI has progressed from simple algorithms to advanced machine learning."},{"title":"The Physics of Gravitational Waves","content":"Gravitational waves provide a new way to observe cosmic events."},{"title":"The Role of STEM Education","content":"STEM programs prepare students for careers in science and technology."},{"title":"The Ethics of Genetic Engineering","content":"CRISPR technology raises questions about the future of genetic modification."},{"title":"Renewable Energy in Developing Countries","content":"Solar and wind projects are transforming energy access in remote areas."},{"title":"The Search for Dark Matter","content":"Physicists are investigating the mysterious substance that shapes the universe."},{"title":"The Role of Artificial Intelligence in Finance","content":"AI is improving fraud detection, trading strategies, and financial planning."},{"title":"The Impact of Climate Activism","content":"Grassroots movements are driving policy changes and awareness on climate issues."}]# Convert to llama_index Documentsdocuments_list = [Document( text=doc["content"], metadata={"title": doc["title"]} )for doc in documents]# Step 5: Set up the embedding modelembed_model =OpenAIEmbedding( model="text-embedding-ada-002", embed_batch_size=10# Adjust batch size as needed)# Step 6: Set up the vector store using WeaviateVectorStorevector_store =WeaviateVectorStore( weaviate_client=client, index_name=COLLECTION_NAME,)# Step 7: Create the storage and service contextsstorage_context = StorageContext.from_defaults(vector_store=vector_store)service_context = ServiceContext.from_defaults(embed_model=embed_model)# Step 8: Build the index and add documents to Weaviateprint("Adding documents to Weaviate...")index = VectorStoreIndex.from_documents( documents_list, storage_context=storage_context, service_context=service_context)print("Documents added to Weaviate.")# Step 9: Perform a similarity searchquery_text ="How does renewable energy help combat climate change?"print("Performing similarity search in Weaviate...")# Perform the query using the query enginequery_engine = index.as_query_engine(top_k=5)response = query_engine.query(query_text)# Print the resultsprint("\nSearch results:")print(f"Response: {response}")print("\nSource documents:")for idx, node inenumerate(response.source_nodes, start=1): doc = node.node title = doc.metadata.get("title", "No Title") content = doc.get_text()print(f"\nResult {idx}:")print(f"Title: {title}")print(f"Content: {content}")