Python-Skripte erstellen: DALL-E 3 Bildgenerierung und GPT-4o OCR mit OpenAI

Einführung

Die präzise Textgenerierung in KI-generierten Bildern ist entscheidend für Anwendungen wie Präsentationen, Marketing und Bildungsinhalte. Im vorherigen Beitrag, Evaluating Text Precision in AI-Generated Images: A Comparison of DALL-E 3 and Mistral, habe ich untersucht, wie diese Modelle mit der Textgenerierung umgehen. Diese Bewertung basierte auf zwei Python-Skripten, die im Mittelpunkt dieses Beitrags stehen:

Ein Skript zur Generierung von Bildern mit spezifischem Text unter Verwendung von DALL-E 3 über die OpenAI API.
Ein Skript zur Extraktion und Überprüfung von Text aus den generierten Bildern unter Verwendung von OCR (Optical Character Recognition) mit GPT-4o, ebenfalls über die OpenAI API.

Beide Skripte benötigen einen gültigen OpenAI API-Schlüssel, um zu funktionieren. Stellen Sie sicher, dass Sie einen API-Schlüssel in Ihrer Umgebung eingerichtet haben, bevor Sie diese Skripte ausführen. Der Schlüssel ermöglicht die Authentifizierung und erlaubt den Skripten, mit den Endpunkten von OpenAI für die Bildgenerierung und Textextraktion zu interagieren.

Dieser Beitrag bietet einen Überblick über diese Skripte und hebt ihren Zweck und ihre Rolle bei der Bewertung der Textgenauigkeit hervor. Lassen Sie uns eintauchen!

Bildgenerierungsskript

Dieses Skript verwendet die DALL-E 3 API von OpenAI, um Bilder mit präzisem Text basierend auf vordefinierten Prompts zu generieren. Es akzeptiert ein Befehlszeilenargument zur Auswahl des Prompts und speichert das generierte Bild in einem angegebenen Dateipfad. Das Skript sorgt für eine Fehlerbehandlung bei API-Interaktionen und validiert die Eingaben.

"""
Script Name: DALL-E 3 Image Generator

Description:
This script generates an image using OpenAI's DALL-E 3 model based on a predefined prompt and saves it to the specified output path.
It allows users to select one of three predefined prompts via a command-line argument.

Usage:
    python openai_image_gen.py <output_file_path> [prompt_number]

Arguments:
    <output_file_path> : The path where the generated image will be saved.
    [prompt_number]    : Optional. Specifies which prompt to use. Must be 1, 2, or 3. Defaults to 1.

Prompts:
    1. A professional presentation slide titled "Large Language Models (LLMs)" listing specific LLM names.
    2. A professional presentation slide titled "Company Structure" listing specific department names.
    3. A professional presentation slide titled "University Departments" listing specific academic departments.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Examples:
    1. Generate an image with the first prompt and save to output_image.png:
        python openai_image_gen.py output_image.png

    2. Generate an image with the second prompt and save to output_image.png:
        python openai_image_gen.py output_image.png 2

Error Handling:
    - The script validates the presence of the API key.
    - Ensures the prompt number is within the valid range (1–3).
    - Provides detailed error messages for API failures.

"""

import os
import requests
import sys

def generate_image(output_path, prompt_number):
    """
    Generate an image using OpenAI's DALL-E 3 and save it to the specified output path.

    Parameters:
        output_path (str): Path to save the generated image.
        prompt_number (int): The prompt number to use (1, 2, or 3).
    """
    # Get the API key from the environment variable
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    # Define prompts
    prompts = [
        "A clean and professional presentation slide design with the title 'Large Language Models (LLMs)' at the top center. Below, list exactly these and only these names of LLMs as bullet points: 'Mistral,' 'ChatGPT,' 'Claude,' 'LLaMA,' 'Gemini,' and 'Falcon.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'Company Structure' at the top center. Below, list exactly these and only these department names as bullet points: 'Human Resources,' 'Finance,' 'Marketing,' 'Sales,' 'Operations,' and 'Research & Development.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'University Departments' at the top center. Below, list exactly these and only these university departments as bullet points: 'Computer Science,' 'Mathematics,' 'Physics,' 'Biology,' 'Economics,' and 'History.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements."
    ]

    # Select the prompt based on the prompt number
    try:
        prompt = prompts[prompt_number - 1]
    except IndexError:
        raise ValueError("Invalid prompt number. Please choose 1, 2, or 3.")

    # Prepare the headers and payload for the request
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}",
    }

    payload = {
        "model": "dall-e-3",
        "prompt": prompt,
        "n": 1,
        "size": "1024x1024"
    }

    # Send the request to OpenAI API
    response = requests.post("https://api.openai.com/v1/images/generations", headers=headers, json=payload)

    # Check for errors in the response
    if response.status_code != 200:
        raise RuntimeError(f"OpenAI API returned an error: {response.status_code} - {response.text}")

    # Extract the image URL from the response
    image_url = response.json()['data'][0]['url']

    # Download and save the image
    image_data = requests.get(image_url).content
    with open(output_path, 'wb') as output_file:
        output_file.write(image_data)

    print(f"Image saved to {output_path}")

if __name__ == "__main__":
    # Check if the correct arguments are provided
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print("Usage: python openai_image_gen.py <output_file_path> [prompt_number]")
        sys.exit(1)

    # Get the output file path from the command-line argument
    output_file_path = sys.argv[1]

    # Get the prompt number, defaulting to 1
    try:
        prompt_number = int(sys.argv[2]) if len(sys.argv) == 3 else 1
    except ValueError:
        print("Prompt number must be an integer (1, 2, or 3).")
        sys.exit(1)

    try:
        generate_image(output_file_path, prompt_number)
    except Exception as e:
        print(f"Error: {e}")

OCR-Skript

Dieses Skript extrahiert Text aus Bildern unter Verwendung der OCR-Fähigkeiten von GPT-4o von OpenAI. Es kodiert Bilder im Base64-Format, sendet sie an die API und ruft den extrahierten Text ab. Das Skript ist auf Genauigkeit und Einfachheit ausgelegt, was die Integration in Bewertungs-Workflows erleichtert.

"""
Script Name: OpenAI Image OCR with GPT-4o

Description:
This script extracts text from an image using OpenAI's GPT-4o model. It encodes the image in Base64 format, embeds it into a JSON payload, and sends it to OpenAI's API for processing. The script then extracts and prints the text detected in the image.

Usage:
    python openai_image_ocr.py <image_file_path>

Arguments:
    <image_file_path> : The path to the image file from which text will be extracted.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable for authentication.

Workflow:
    1. Encode the image into a Base64 string.
    2. Send the image as part of a JSON payload to OpenAI's API.
    3. Retrieve and display the text extracted from the image.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Error Handling:
    - Validates the presence of the API key.
    - Checks for a valid image file path as an argument.
    - Handles and displays any errors from the OpenAI API.

Examples:
    1. Extract text from an image file:
        python openai_image_ocr.py example.jpg

    2. Set the API key in the environment and extract text:
        export OPENAI_API_KEY="your-api-key"
        python openai_image_ocr.py example.jpg
"""

import base64
import requests
import os
import sys

def encode_image(image_path):
    """
    Encode an image as a base64 string for embedding in JSON payloads.
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def extract_text_from_image(image_path):
    """
    Extract text from an image using OpenAI's GPT-4o model.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What text do you see in this image? Please provide only the extracted text without any additional commentary."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encode_image(image_path)}"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    res_json = response.json()

    # Extract and return text from the response
    return res_json['choices'][0]['message']['content']

if __name__ == "__main__":
    # Check if an image file path is provided as an argument
    if len(sys.argv) != 2:
        print("Usage: python openai_image_ocr.py <image_file_path>")
        sys.exit(1)

    # Get the image file path from the command line argument
    image_path = sys.argv[1]

    try:
        # Extract text from the image
        extracted_text = extract_text_from_image(image_path)
        print("Extracted Text:")
        print(extracted_text)
    except Exception as e:
        print(f"Error: {e}")

Erkenntnisse

Einfacher Workflow: Diese Skripte bieten klare Funktionalitäten für die Bildgenerierung und Textextraktion. Ziehen Sie in Betracht, eine Ratenbegrenzung hinzuzufügen, wenn Sie sie für die Stapelverarbeitung verwenden, um API-Kontingente effektiv zu verwalten.
Anpassbar: Sie können an verschiedene Anwendungsfälle angepasst oder in andere KI-Tools integriert werden. Denken Sie daran, die API-Kosten zu überwachen und erwägen Sie das Zwischenspeichern generierter Bilder, um die Kosten zu reduzieren.

Hat Ihnen dieser Beitrag gefallen? Fanden Sie ihn hilfreich? Hinterlassen Sie gerne einen Kommentar unten, um Ihre Gedanken mitzuteilen oder Fragen zu stellen. Ein GitHub-Konto ist erforderlich, um an der Diskussion teilzunehmen.