Extracting images and text from each page of a PDF.
Generating embeddings for these components using ColPali.
Storing the embeddings in LanceDB.
defingest_data(folder_with_pdfs:str,table_name:str="demo",db_path:str="lancedb"): model, processor =get_model_colpali() pdfs = [x for x inPath(folder_with_pdfs).iterdir()if x.name.endswith('.pdf')]print(f"Input PDFs {pdfs}")for pdf_path intqdm(pdfs):print(f"Getting images and text from {pdf_path}") page_images, page_texts =get_pdf_images(pdf_path=pdf_path)print(f"Getting embeddings from {pdf_path}") page_embeddings =get_images_embedding(images=page_images, model=model, processor=processor)print(f"Adding to db {pdf_path}") table =add_to_db(pdf_path=pdf_path, page_images=page_images, page_texts=page_texts, page_embeddings=page_embeddings, table_name=table_name, db_path=db_path)print(f"Done! {pdf_path} should be in {table} table.")print("All files are processed")
The processed data, including embeddings and metadata, is stored in LanceDB, a vector database optimized for high-speed search and retrieval.
4. Deploy the Generative LLM
Once the data is ingested and stored in LanceDB, deploy the generative LLM server for processing multimodal queries. This server runs the Llama 3.2 Vision-Instruct model, enabling responses based on both text and visual data.
Deploying the Server: The command sets up the generative LLM server within Apolo’s infrastructure, running the meta-llama/Llama-3.2-11B-Vision-Instruct model.
Secure Storage Integration: The model weights are accessed securely via the mounted storage:visual_rag directory.
Multimodal Inference: The server is configured to handle multimodal queries, such as combining text and images for processing.
With this setup, your generative LLM is ready to serve multimodal queries, providing the backbone for the Visual RAG pipeline. The system can now combine the embeddings retrieved from LanceDB with the user queries, using the model to generate comprehensive and accurate responses.
5. Querying the System
With the ingestion pipeline and LLM server running, you can query the system using the ask_data function:
defask_data(user_query="What is market share by region?",table_name:str="demo",db_path:str="lancedb",base_url:str="http://generation-inference--9771360698.jobs.scottdc.org.neu.ro/v1",top_k:int=5): model, processor =get_model_colpali()print(f"Asking {user_query} query.")print("1. Search relevant images") query_embeddings =get_query_embedding(query=user_query, model=model, processor=processor) results =search_db(query_embeddings=query_embeddings, processor=processor, db_path=db_path, table_name=table_name, top_k=top_k)print(f"result most relevant {results}")print("2. Build prompt")# https://cookbook.openai.com/examples/custom_image_embedding_search#user-querying-the-most-similar-image prompt =f""" Below is a user query, I want you to answer the query using images provided. user query:{user_query} """print(f"Prompt = {prompt}")print("3. Query LLM with prompt and relavent images") input_images = [results[idx]['pil_image'] for idx inrange(top_k)] llm_response =run_vision_inference(input_images=input_images, prompt=prompt, base_url=base_url)print(f"llm_response {llm_response}")
Here’s how it works:
Query Embedding: The user query is embedded using ColPali in get_query_embedding.
Database Search: search_db retrieves the most relevant images based on embeddings.
Response Generation: A vision-enabled LLM (e.g., Llama 3.2) processes the prompt and images via run_vision_inference.
6. Visualizing the Results
To enhance usability, integrate a Streamlit-based dashboard for querying and visualizing responses. The dashboard includes:
PDF Viewer: Displays available documents for context.
Search Input: Allows users to submit natural language queries.
Results Panel: Shows the retrieved images and the LLM-generated responses.
For example, querying “What is the market share by region?” retrieves visuals related to market share and generates a concise, context-aware response.