Document Inlining Crossing the Modality Gap with Compound AI

This blog post from Fireworks.ai introduces Document Inlining, a new compound AI system designed to enhance Large Language Model (LLM) interaction with non-textual data like PDFs and images. The system aims to bridge the "modality gap" that often results in lower quality outputs from vision-language models (VLMs) compared to text-based LLMs processing the same information.

What is Document Inlining?

Document Inlining converts visual document data (PDFs, images) into structured text, making it readily digestible by LLMs. This two-step process involves parsing the visual content and then feeding the transcribed text to the LLM for processing and reasoning.

Addressing Challenges

This approach addresses challenges like accurate OCR for complex document structures (tables, charts), managing the conversion pipeline, and optimizing for speed and cost by avoiding redundant transcriptions.

Evaluation and Results

Fireworks.ai's evaluation shows that using Document Inlining with a text-based LLM outperforms using a VLM directly with the same visual input, demonstrating improved reasoning and accuracy. Furthermore, using Document Inlining with a VLM significantly improves its performance compared to directly feeding the VLM image data.

Conclusion

Document Inlining offers a more efficient and higher-quality alternative to using VLMs directly for document-based tasks. By leveraging the strengths of specialized text-based LLMs, this compound AI system simplifies the process for developers, improves accuracy, and offers flexibility in model selection. The system is currently in public preview with no additional cost beyond standard LLM usage fees.

Source(s):

Fireworks.ai Blog: Document Inlining: Crossing the Modality Gap with Compound AI