Architecture Overview

The Visual RAG pipeline consists of the following key components:

Data Ingestion: PDFs are uploaded to Apolo’s object storage and processed by a job that uses ColPali to generate embeddings for text and images.
Storage:

LanceDB serves as the vector database for storing embeddings.
Apolo’s storage backend is used to persist raw data and intermediate outputs.

Query Handling:

User queries are embedded using ColPali.
LanceDB retrieves the most relevant PDF pages (text and image embeddings).

Response Generation: A visual LLM takes retrieved pages and the user query as input, generating a comprehensive answer.
Visualization: Results are displayed via a Streamlit dashboard, showing the top-matched images and the LLM’s response.

PreviousVisual RAG on Complex PDFs: Enterprise-Ready Multimodal AI NextImplementation

Last updated 1 year ago

Was this helpful?