Published on
Fireworks

Evaluating Fireworks Inlining: Strengths and Limitations in Parsing PDFs and Images

12 min0Comments
Post image placeholder

This post evaluates the document inlining functionality offered by Fireworks, which allows users to process PDFs and images by converting their content into a queryable format. This feature enables seamless extraction and analysis of text and structured data using natural language queries powered by Large Language Models (LLMs).

Document inlining leverages specialized text models, enabling superior reasoning compared to traditional Vision-Language Models (VLMs) in text-heavy tasks. By reducing the need for manual data handling, it enhances efficiency and enables advanced interactions with multimodal content. For this evaluation, I used the Llama model (llama-v3p3-70b-instruct) deployed on Fireworks to explore its effectiveness with three practical cases.

For more details, see my blog entry on Fireworks Document Inlining.

Evaluation Environment

To replicate this evaluation, you will need:

  1. An API key from Fireworks AI, as document inlining and inference are performed via the Fireworks LLM provider.
  2. A Python environment to execute the script.

Python Script for Document Inlining

Following the Fireworks example, a Python script was developed to evaluate the document inlining capabilities for PDFs and images via the Fireworks API. The inlining API supports two input methods: providing direct URLs to documents (PDFs or images) or uploading local files. For this evaluation, the decision was made to use local documents exclusively, as this aligns better with the intended use cases of handling private or offline data.

The script processes local files by encoding them as Base64 URLs, sends them to the API, and allows users to query the inlined content using natural language prompts. It supports multiple files in a single execution and provides an option to specify a custom LLM model, with the default set to "llama-v3p3-70b-instruct".

To execute the script, run it from the command line, providing the file paths, a query prompt, and optionally a model name. For example:

python script_name.py file1.pdf file2.pdf "Your query here"

The complete Python script is provided below.

"""
Script for interacting with the Fireworks API to process files and generate responses based on a user-provided prompt.

Features:
- Supports multiple file types (PDF, PNG, JPG, GIF, TIFF) by encoding them as Base64 URLs.
- Sends the files and prompt to the Fireworks API for processing.
- Allows specifying an optional model name; defaults to "llama-v3p3-70b-instruct".
- Prints the API's response in a structured and user-friendly format.

Usage:
python script_name.py <path_to_file1> [<path_to_file2> ... <path_to_fileN>] <prompt> [<model_name>]

Requirements:
- Set the FIREWORKS_API_KEY environment variable with a valid API key.
- Install the requests library.

Example:
python script_name.py sample1.pdf sample2.pdf "Extract data from these documents." "custom/model/path"
"""

import requests
import json
import os
import sys
import base64

# Fetch the API key from the environment variable
api_key = os.getenv("FIREWORKS_API_KEY")

if not api_key:
    print("Error: FIREWORKS_API_KEY environment variable is not set.")
    sys.exit(1)

# API endpoint
url = "https://api.fireworks.ai/inference/v1/chat/completions"

# Function to encode a file to Base64 data URL based on its type
def encode_file_to_base64_data_url(file_path):
    _, file_extension = os.path.splitext(file_path.lower())

    # Determine MIME type based on file extension
    mime_type = None
    if file_extension == ".pdf":
        mime_type = "application/pdf"
    elif file_extension in [".png"]:
        mime_type = "image/png"
    elif file_extension in [".jpg", ".jpeg"]:
        mime_type = "image/jpeg"
    elif file_extension == ".gif":
        mime_type = "image/gif"
    elif file_extension == ".tiff":
        mime_type = "image/tiff"
    else:
        print(f"Unsupported file extension: {file_extension}")
        sys.exit(1)

    # Read the file and encode it to Base64
    try:
        with open(file_path, "rb") as file:
            base64_data = base64.b64encode(file.read()).decode("utf-8")
        return f"data:{mime_type};base64,{base64_data}#transform=inline"
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' does not exist.")
        sys.exit(1)
    except Exception as e:
        print(f"An error occurred while encoding the file: {e}")
        sys.exit(1)

# Ensure there are at least two arguments (files and prompt)
if len(sys.argv) < 3:
    print("Usage: python script_name.py <path_to_file1> [<path_to_file2> ... <path_to_fileN>] <prompt> [<model_name>]")
    sys.exit(1)

# Separate file paths, prompt, and optional model name
if sys.argv[-1].startswith("accounts/"):
    model_name = sys.argv[-1]
    prompt = sys.argv[-2]
    file_paths = sys.argv[1:-2]
else:
    model_name = "accounts/fireworks/models/llama-v3p3-70b-instruct"
    prompt = sys.argv[-1]
    file_paths = sys.argv[1:-1]

# Validate that at least one file is provided
if not file_paths:
    print("Error: At least one file must be specified.")
    sys.exit(1)

# Print the model being used
print(f"\nUsing model: {model_name}\n")

# Encode all files to Base64 data URLs
data_urls = []
for file_path in file_paths:
    data_urls.append(encode_file_to_base64_data_url(file_path))

# Create messages for each file
file_messages = [
    {
        "type": "image_url",
        "image_url": {
            "url": data_url
        }
    }
    for data_url in data_urls
]

# Payload for the API request
payload = {
    "model": model_name,
    "max_tokens": 16384,
    "top_p": 1,
    "top_k": 40,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "temperature": 0.6,
    "messages": [
        {
            "role": "user",
            "content": file_messages + [
                {
                    "type": "text",
                    "text": prompt
                }
            ]
        }
    ]
}

# Headers for the API request
headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

try:
    # Print the prompt
    print("\nPrompt:")
    print(f"- {prompt}")

    # Print the files being processed
    print("\nFiles:")
    for file_path in file_paths:
        print(f"- {file_path}")

    # Make the POST request to the API
    response = requests.post(url, headers=headers, data=json.dumps(payload))

    # Raise an HTTPError for bad responses (4xx and 5xx)
    response.raise_for_status()

    # Parse the JSON response
    response_data = response.json()

    # Print formatted output
    print("\nResponse from API:")
    print(f"ID: {response_data.get('id')}")
    print(f"Model: {response_data.get('model')}\n")

    # Print choices header
    print("Response Text:")
    for choice in response_data.get('choices', []):
        print(f"- Finish Reason: {choice.get('finish_reason')}")
        print(f"  Content: \n{choice['message']['content']}\n")

    # Print usage details
    print("Usage:")
    usage = response_data.get('usage', {})
    print(f"- Prompt Tokens: {usage.get('prompt_tokens')}")
    print(f"- Completion Tokens: {usage.get('completion_tokens')}")
    print(f"- Total Tokens: {usage.get('total_tokens')}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Evaluation Cases

Case 1: Comparing Two PDF Documents

Objective: Analyze two research papers in PDF format and identify key commonalities and differences.

Execution: The script was executed with the two locally stored PDF documents and the provided prompt. The command used was:

python3 inline_fireworks.py data/2408.02479v1.pdf data/2411.00932v1.pdf "Identify the titles of both provided papers and compare their content. Summarize their common points and highlight key differences."

Results: The following output was generated by the execution of the script:

The two papers provided are:

1. "From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future"
2. "LLMs: A Game-Changer for Software Engineers?"

Comparing their content, both papers discuss the application of Large Language Models (LLMs) in software engineering, highlighting their potential to revolutionize the field. They explore the capabilities and limitations of LLMs, as well as their future directions and challenges.

**Common points between the two papers include:**

- The discussion of LLMs' ability to automate tasks such as code generation, testing, and debugging
- The potential of LLMs to improve software development productivity and quality
- The need for human oversight and review of LLM-generated code to ensure accuracy and security
- The importance of addressing ethical concerns, such as bias and intellectual property, in the development and use of LLMs

**Key differences between the two papers include:**

- The first paper focuses specifically on LLM-based agents, which are designed to work autonomously and make decisions, whereas the second paper provides a more general overview of LLMs in software engineering.
- The first paper provides a detailed analysis of the architecture and benchmarks used in LLM-based agents, while the second paper discusses the technical strengths and benefits of LLMs in software engineering, as well as their challenges and future directions.
- The first paper is more focused on the academic and research aspects of LLM-based agents, while the second paper is more geared towards practitioners and industry professionals, providing a more practical perspective on the adoption and use of LLMs in software engineering.

Overall, both papers contribute to the growing body of research on LLMs in software engineering, highlighting their potential to transform the field and improve software development practices. However, they approach the topic from different angles, with the first paper focusing on the technical aspects of LLM-based agents and the second paper providing a more general and practical overview of LLMs in software engineering.

Evaluation: The model effectively identified the titles of both papers and extracted detailed comparisons of their content. The results demonstrated a good understanding of the topics, highlighting commonalities and differences comprehensively. This case shows the capability of Fireworks to analyze structured text and provide insightful comparisons. However, slight improvements in phrasing common points and differences could further enhance readability and clarity.

Case 2: OCR of Document Images

Objective: Perform OCR on two local document images in a single call, extracting their content and analyzing word differences.

The evaluation was conducted using two locally stored document images, shown below:

First document image
Figure 1: First document image used for OCR analysis.
Second document image
Figure 2: Second document image used for OCR analysis.

Execution: The script was executed with the two local image files and the provided prompt. The command used was:

python3 inline_fireworks.py data/example_01_mistral_exec_01.png data/example_01_mistral_exec_02.png "Identify common and unique words in the two provided files."

Results: The following output was generated by the execution of the script:

To identify the common and unique words in the two provided files, let's analyze the lists of Large Language Models (LLMs) mentioned:

### First File:
1. Mistral
2. ChatGPT
3. Clude LLaMA
4. Gemini
5. Falcon

### Second File:
1. Mistral
2. Mistral
3. ChatGPT
4. CLaude
5. LLaMA
6. Gemini/Falcon

### Common Words:
- **Mistral**: Appears in both files.
- **ChatGPT**: Appears in both files.
- **Gemini** and **Falcon**: While presented as a single entity in the second file (Gemini/Falcon), they are individually listed in the first file. Considering the intent, these can be seen as common, albeit presented differently.
- **LLaMA**: Appears in the first file as part of "Clude LLaMA" and separately in the second file. Assuming "Clude LLaMA" refers to the same model as "LLaMA" in the context of these lists, LLaMA can be considered common.

### Unique Words:
- **Clude**: This appears to be a typo or variation in the first file ("Clude LLaMA") and is presented as "CLaude" in the second file. If "CLaude" is considered a distinct model from "Clude LLaMA," then "CLaude" is unique to the second file.

Correcting for the apparent typo and considering the intent behind the naming:
- The first file's "Clude LLaMA" likely refers to "CLaude" and "LLaMA" as separate entities, making "CLaude" a unique mention in the second file when compared to the first file's listing.
- "LLaMA" is common, given it's part of "Clude LLaMA" in the first file and listed separately in the second.

Thus, the unique word, considering corrections for typos and intent, would be "CLaude" as it's explicitly listed that way only in the second file, and there's a duplication of "Mistral" in the second file not present in the first.

Evaluation: The OCR process was effective, capturing key information and distinguishing between common and unique words accurately.

Case 3: Querying a Bus Timetable Image

Objective: Ask questions about an image containing a bus timetable extracted from here.

The evaluation was conducted using a locally stored image of a bus timetable, shown below:

Bus timetable image
Figure 3: The bus timetable image used for querying.
  • Document: A locally stored image of a bus timetable
  • Prompt: A question on the departure times from a bus stop in the timetable from Monday to Friday.

Execution: The script was executed with the local image file and the provided prompt. The command used was:

python3 inline_fireworks.py data/whitsunday_bus_timetable.jpg "Based on the attached image, give me departure times at Flametree from Monday to Friday. Simply reply I don't know if you cannot find them in the attached picture."

Results: The following output was generated by the execution of the script:

I don't know.

To further evaluate performance, the script was executed using the qwen2p5-72b-instruct model with the following command:

python3 inline_fireworks.py data/whitsunday_bus_timetable.jpg "Based on the attached image, give me departure times from Flametree from Monday to Friday. Simply reply I don't know if you cannot find them in the attached picture." accounts/fireworks/models/qwen2p5-72b-instruct

The output remained the same, indicating that the issue lies in OCR-to-context integration rather than the specific model used.

Evaluation:
This case highlights a limitation in processing tabular data within images. The consistent 'I don't know' response across both models suggests that the issue may stem from either the inlining process's parsing or the OCR's data extraction capabilities. Addressing these areas could improve Fireworks' handling of structured formats in images like timetables.

Conclusion

The document inlining functionality in Fireworks demonstrates strong potential for parsing and querying PDFs and images, offering notable productivity benefits for tasks involving text extraction and analysis. The evaluation showed effective performance in processing and comparing text from PDF documents (Case 1) and extracting content from document images (Case 2), successfully identifying key details and generating meaningful outputs.

However, limitations were evident in interpreting structured data formats like timetables within images (Case 3), where the model consistently failed to extract relevant information, likely due to challenges in parsing during the inlining process or OCR. These results suggest that while Fireworks is a valuable tool for unstructured and semi-structured text, its capabilities in handling structured formats like tables require further development. Enhancing these aspects would significantly broaden its utility in real-world applications.


Enjoyed this post? Found it helpful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.

Keep reading

Related posts