Apolo
HomeConsoleGitHub
  • Apolo concepts
  • CLI Reference
  • Examples/Use Cases
  • Flow CLI
  • Actions Reference
  • Apolo Extras CLI
  • Python SDK
  • Enterprise-Ready Generative AI Applications
    • Apolo Documentation Chatbot
    • Canada Budget RAG
  • Visual RAG on Complex PDFs: Enterprise-Ready Multimodal AI
    • Architecture Overview
    • Implementation
  • Generic
    • ML Model Lifecycle on Apolo Platform
      • End-to-End ML Model Lifecycle using Apolo CLI
      • ML Model Lifecycle using Apolo Console
  • Large language models
    • DeepSeek-R1 distilled models
    • DeepSeek-R1 model deployment
    • Teaching Models To Reason - Training, Fine-Tuning, and Evaluating Models with LLaMA Factory on Apolo
    • Autonomous Research with AgentLaboratory: DeepSeek R1 vs. OpenAI o1
  • Image and Video processing
    • Synthetic Data Generation using Stable Diffusion
    • HOWTO: Lora models with Stable Diffusion
  • Audio Processing
    • Deploying Text-to-Speech and Speech-to-Text Models in Apolo
Powered by GitBook
On this page

Was this helpful?

  1. Visual RAG on Complex PDFs: Enterprise-Ready Multimodal AI

Architecture Overview

PreviousVisual RAG on Complex PDFs: Enterprise-Ready Multimodal AINextImplementation

Last updated 2 months ago

Was this helpful?

The Visual RAG pipeline consists of the following key components:

  1. Data Ingestion: PDFs are uploaded to Apolo’s object storage and processed by a job that uses ColPali to generate embeddings for text and images.

  2. Storage:

  • LanceDB serves as the vector database for storing embeddings.

  • Apolo’s storage backend is used to persist raw data and intermediate outputs.

  1. Query Handling:

  • User queries are embedded using ColPali.

  • LanceDB retrieves the most relevant PDF pages (text and image embeddings).

  1. Response Generation: A visual LLM takes retrieved pages and the user query as input, generating a comprehensive answer.

  2. Visualization: Results are displayed via a Streamlit dashboard, showing the top-matched images and the LLM’s response.