Bridging the Visual Divide: Proxy-Pointer RAG Achieves Grounded Image Retrieval in Enterprise Chatbots

The adage "a picture is worth a thousand words" has long been a cornerstone of effective communication. Yet, for enterprise chatbots, reliably delivering visual content grounded in their source documents has remained an elusive goal. While current systems can offer links to brochures, videos, or manuals, the direct inclusion of targeted, relevant images within chatbot responses has been hampered by significant technical challenges. This limitation is particularly acute in fields like real estate, where property visuals are paramount, or in technical support, where machine parameters are best understood through accompanying diagrams and tables. Now, a novel open-source solution, the MultiModal Proxy-Pointer Retrieval-Augmented Generation (RAG) pipeline, promises to bridge this critical gap by fundamentally rethinking how information is processed and retrieved.
The core innovation lies in its departure from traditional RAG methodologies. Instead of treating documents as mere collections of text chunks, the Proxy-Pointer approach views them as hierarchical semantic structures. This allows the system to maintain the integrity of document sections, including their embedded visual elements, ensuring that the context surrounding an image is preserved and accessible to the language model. This represents a significant advancement over existing methods, which often struggle with misaligned retrieval units and semantic units, leading to incomplete captions or the selection of visually similar but contextually irrelevant images.
The Challenge of Multimodal RAG: Beyond Text-Only Limitations
The quest for multimodal RAG, specifically the ability for chatbots to return images, has been fraught with difficulties. Current approaches primarily fall into two categories, each with inherent drawbacks:
Image Captioning: A Fragmented Approach
One prevalent method involves employing Optical Character Recognition (OCR) and vision models to convert images into textual descriptions. These captions are then indexed alongside the document’s text. However, this technique suffers from the limitations of traditional chunking mechanisms. As documents are broken down into fixed-size segments, image captions can be split across different chunks. This fragmentation means that when a chunk is retrieved, the language model (LLM) might only receive a partial caption, making it difficult to ascertain the image’s true relevance to the query or even to its adjacent, unretrieved text. Furthermore, the LLM synthesizing the response may be presented with multiple, potentially unrelated image captions from disparate documents, increasing the likelihood of misattribution or the decision to withhold all visual information to avoid errors.
Multimodal Embeddings: Similarity Without Grounding
Another avenue explored is the use of multimodal embedding models, which map both text and images into a shared vector space. While this enables cross-modal retrieval, allowing queries to match both text and image content, it prioritizes similarity over precise grounding. This can lead to situations where visually or structurally similar elements, such as financial tables from different reports, appear nearly identical in the vector space. Without the crucial context of document structure, the system retrieves candidates based on superficial similarity but lacks the confidence to determine which specific image is contextually accurate. This forces the LLM to choose from multiple plausible but potentially incorrect visuals, often leading it to err on the side of caution and return no image at all.
The Proxy-Pointer RAG pipeline offers a solution by replacing arbitrary text-based chunking with a tree-based structure that respects sectional boundaries. This ensures that an entire semantic unit—a section containing paragraphs and images—is treated as an independent entity. This approach allows the LLM to make more informed judgments about image relevance based on the complete contextual information of the section.
The Proxy-Pointer MultiModal Architecture: A Structured Approach to Visual Retrieval
The breakthrough achieved by the Proxy-Pointer MultiModal RAG pipeline is rooted in its ability to maintain document structure throughout the retrieval process. This is accomplished by modifying the standard RAG pipeline with a crucial premise: visual artifacts, such as figures, tables, and even video clips, can be extracted as separate files and stored alongside the document’s textual content. While this is straightforward for web-based or XML documents, for formats like PDF, specialized extractors, such as the Adobe PDF Extract API utilized in this project, are employed to capture tables and figures as distinct artifacts.

Within the processed document, typically converted to a markdown format, these extracted images are referenced via relative paths. For instance, a figure might be represented as , directly linking the textual reference to the actual image file.
The key insight driving this architecture is that the LLM does not necessarily need to "see" the image itself to determine its relevance. Instead, it needs to understand that an image exists within a specific, semantically coherent section of the document. By retrieving entire sections rather than fragmented chunks, Proxy-Pointer RAG provides the LLM with the complete contextual information necessary to make accurate relevance judgments. This transforms image selection from an open-ended search problem based on multimodal similarity into a conditional decision guided by the meaning of the section and the user’s query. This mirrors human reading habits, where initial section context guides the decision to examine specific visuals.
The Indexing Pipeline: Building a Structured Knowledge Base
The indexing process within the Proxy-Pointer framework has been adapted to accommodate multimodal data:
-
Skeleton Tree Construction: The markdown headings are parsed into a hierarchical tree. Crucially, each node in this tree now includes a "figures" array, which lists all figures found within that specific section, along with their file paths. This structure is illustrated by a sample node containing title, ID, line number, and a list of figures with their respective IDs and filenames.
-
Breadcrumb Injection: To enhance context, the full structural path of each section (e.g., "Galore > 3. Methodology > 3.1. Zero Convolution") is prepended to the text before embedding. This provides a richer contextual signal for retrieval.
-
Structure-Guided Chunking: Text is divided into chunks strictly within section boundaries, preventing the fragmentation of semantic units.
-
Noise Filtering: An LLM is employed to identify and remove irrelevant sections, such as tables of contents, glossaries, executive summaries, and references, from the index.
-
Pointer-Based Context: Retrieved chunks act as pointers, enabling the synthesizer to load the complete, unbroken document section—including its embedded image paths—for final processing.

The Retrieval Pipeline: From Broad Recall to Context-Aware Selection
The retrieval process has been refined for multimodal outputs:
-
Stage 1 (Broad Recall): A vector index, such as FAISS, returns the top 200 chunks based on embedding similarity. These are then deduplicated by document and node ID to ensure unique document sections are considered, narrowing down the candidates to approximately 50 nodes.
-
Stage 2 (Anchor-Aware Structural Re-Ranking): The re-ranker receives the full breadcrumb path along with a short semantic snippet (150 characters) for each candidate. This is particularly important for academic papers, where headings can be generic. The semantic snippet provides a crucial hint to the LLM, enabling it to pinpoint the most relevant sections among vague headings.
-
Stage 3 (Synthesis and Context-Aware Image Selection): The synthesizer LLM reviews the top
k=5sections. It constructs the textual response and simultaneously makes visual decisions by scanning the selected sections for image paths. It then selects a maximum of six images deemed most relevant to the query. A significant capability here is the LLM’s ability to generate accurate image labels, even if the original figure or table lacks an explicit caption. This stage achieves an impressive 95% accuracy for image retrievals on a 20-question benchmark, as independently judged by Claude. -
Stage 4 (Vision Filter – Optional): For further refinement, an optional vision filter can be enabled. In this stage, the LLM visually analyzes the selected images, considers the user query and the generated text response, and discards any images that do not align. This results in highly curated images but adds a few seconds of latency.
Prototype Implementation and Demonstrated Results
To validate the efficacy of the MultiModal Proxy-Pointer RAG pipeline, a prototype was developed using five AI research papers (all licensed under CC-BY): CLIP, Nemobot, GaLore, VectorFusion, and VectorPainter. These papers were chosen for their dense textual content and the presence of numerous figures, tables, and formulas, totaling 270 extractable images. The Adobe PDF Extract API was employed for PDF extraction, converting the documents into a markdown format with embedded image references.
The system utilizes the gemini-embedding-001 model for text embeddings, with dimensions reduced to 1536 for faster search and reduced memory usage. Notably, this is a text-only embedding model, avoiding the complexities and limitations of multimodal embeddings. For all LLM tasks—including noise filtering, re-ranking, synthesis, and the optional vision filter—the gemini-3.1-flash-lite-preview model is used. The vector index is managed by FAISS.
The results from a 20-question benchmark demonstrate the system’s robustness. Out of 20 queries, 17 yielded perfect retrievals, one resulted in no image being retrieved (which is a valid outcome if no relevant image exists), and two were partial retrievals. A critical observation is the absence of any instances where an incorrect image from an unrelated document was presented, thereby preserving user trust. This surgical accuracy is attributed to the core principles of Proxy-Pointer.

Illustrative Results:
-
Precise Data Retrieval: When queried about hyperparameters for fine-tuning RoBERTa-Base for GaLore, the system not only provided the correct textual information but also identified and presented Table 7, detailing the hyperparameters, directly within the response.
-
Cross-Document Reasoning: For a query comparing memory efficiency and knowledge preservation strategies of GaLore and CLIP-CITE, the system retrieved and displayed relevant tables from both papers, providing a clear visual comparison of their approaches.
-
Visual Query Interpretation: A query asking to describe the VectorFusion pipeline stages led to the retrieval of Figure 3, visually outlining the process from raster sampling to SVG conversion, and Figure 5, illustrating the latent score distillation procedure.
-
Categorization by Visuals: When asked about games implemented in Nemobot and their categorization, the system correctly identified and presented Table I, which categorizes the games according to Shannon’s game taxonomy.
These examples highlight the system’s capability to go beyond simple text retrieval and integrate relevant visual evidence directly into the chatbot’s response, significantly enhancing user comprehension and trust.
Edge Cases and Design Considerations
While the Proxy-Pointer MultiModal RAG pipeline demonstrates remarkable effectiveness, certain edge cases and design trade-offs warrant consideration:
-
LLM Non-Determinism: Even with a temperature setting of 0.0, LLM outputs can exhibit slight variations. This means that repeated queries might surface marginally different, though still relevant, images. The choice of which image is perceived as "more" relevant can be subjective.

-
Child-Node Figures: For highly specific queries, the system excels at locating precise formulas and figures. However, broad queries that encompass multiple documents or broad sections may lead to the retrieval of header-level nodes. If the associated figures reside in child nodes that fall outside the
k=5context window, they might not be surfaced. However, focusing queries on individual papers typically ensures that adequate child nodes and their relevant figures are brought into context. -
Detached Image Paths: The current approach relies on the assumption that the image path referenced within a retrieved section physically exists within that section. If a figure is referenced in text but stored in a separate section (e.g., an Appendix) that is not retrieved, it will not be displayed. A potential workaround involves naming image files descriptively (e.g.,
table_1.jpg) to allow the synthesizer to construct paths even if direct references are absent, though the core principle of leveraging section context without multimodal embeddings remains central.
Open-Source Availability and Future Directions
The Proxy-Pointer RAG framework, including its multimodal extension, is fully open-source under the MIT License and accessible via the Proxy-Pointer GitHub repository. The multimodal pipeline is being integrated into the existing repository, complementing the text-only version. The project is designed for a rapid deployment, with a documented structure that facilitates a five-minute quickstart.
The project’s repository includes components for model selection (Gemini 3.1 Flash Lite), multimodal RAG logic, markdown tree generation, vector index building, PDF extraction using the Adobe PDF Extract API, and a unified data hub for processed markdown and figures. It also houses a benchmarking hub with test logs and queries, and a Streamlit UI for visualizing outputs.
Conclusion: Towards Truly Informative Chatbots
Multimodal responses have long been envisioned as the next evolutionary step for RAG systems. However, despite significant advancements in vision models and multimodal embeddings, the reliable retrieval of relevant images alongside text has remained an unsolved challenge. The underlying issue is a fundamental misalignment: traditional RAG operates on fragmented text chunks, while visual meaning, like semantic meaning, resides within the holistic structure of a document. Without aligning retrieval to these complete semantic units, even sophisticated models struggle to establish accurate visual associations.
The Proxy-Pointer MultiModal RAG pipeline directly addresses this gap by building upon a foundation of structured context rather than flat chunks. By retrieving complete document sections and treating image paths as pointers to artifacts within them, the system enables accurate, scalable, and cost-effective multimodal responses without the need for expensive multimodal embeddings. This represents a practical leap forward, empowering chatbots to not just narrate information but to visually demonstrate precise evidence, always grounded within the correct contextual framework. The open-source nature of this solution invites broader adoption and further innovation in the field of intelligent information retrieval.






