Création de Scripts Python : Génération d'Images avec DALL-E 3 et OCR avec GPT-4o via OpenAI

Introduction

La génération précise de texte dans les images générées par l'IA est cruciale pour des applications telles que les présentations, le marketing et le contenu éducatif. Dans le précédent article, Évaluation de la Précision du Texte dans les Images Générées par l'IA : Une Comparaison entre DALL-E 3 et Mistral, j'ai exploré comment ces modèles gèrent la génération de texte. Cette évaluation reposait sur deux scripts Python, qui sont au cœur de cet article :

Un script pour générer des images avec un texte spécifique en utilisant DALL-E 3 via l'API OpenAI.
Un script pour extraire et vérifier le texte des images générées en utilisant OCR (Reconnaissance Optique de Caractères) alimenté par GPT-4o, également via l'API OpenAI.

Les deux scripts nécessitent une clé API OpenAI valide pour fonctionner. Assurez-vous d'avoir configuré une clé API dans votre environnement comme prérequis avant d'exécuter ces scripts. La clé permet l'authentification et autorise les scripts à interagir avec les points de terminaison d'OpenAI pour la génération d'images et l'extraction de texte.

Cet article fournit un aperçu de ces scripts, en mettant en avant leur objectif et leur rôle dans l'évaluation de la précision du texte. Plongeons dedans !

Script de Génération d'Images

Ce script utilise l'API DALL-E 3 d'OpenAI pour générer des images avec un texte précis basé sur des invites prédéfinies. Il accepte un argument en ligne de commande pour sélectionner l'invite et enregistre l'image générée dans un chemin de fichier spécifié. Le script assure la gestion des erreurs pour les interactions avec l'API et valide les entrées.

"""
Script Name: DALL-E 3 Image Generator

Description:
This script generates an image using OpenAI's DALL-E 3 model based on a predefined prompt and saves it to the specified output path.
It allows users to select one of three predefined prompts via a command-line argument.

Usage:
    python openai_image_gen.py <output_file_path> [prompt_number]

Arguments:
    <output_file_path> : The path where the generated image will be saved.
    [prompt_number]    : Optional. Specifies which prompt to use. Must be 1, 2, or 3. Defaults to 1.

Prompts:
    1. A professional presentation slide titled "Large Language Models (LLMs)" listing specific LLM names.
    2. A professional presentation slide titled "Company Structure" listing specific department names.
    3. A professional presentation slide titled "University Departments" listing specific academic departments.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Examples:
    1. Generate an image with the first prompt and save to output_image.png:
        python openai_image_gen.py output_image.png

    2. Generate an image with the second prompt and save to output_image.png:
        python openai_image_gen.py output_image.png 2

Error Handling:
    - The script validates the presence of the API key.
    - Ensures the prompt number is within the valid range (1–3).
    - Provides detailed error messages for API failures.

"""

import os
import requests
import sys

def generate_image(output_path, prompt_number):
    """
    Generate an image using OpenAI's DALL-E 3 and save it to the specified output path.

    Parameters:
        output_path (str): Path to save the generated image.
        prompt_number (int): The prompt number to use (1, 2, or 3).
    """
    # Get the API key from the environment variable
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    # Define prompts
    prompts = [
        "A clean and professional presentation slide design with the title 'Large Language Models (LLMs)' at the top center. Below, list exactly these and only these names of LLMs as bullet points: 'Mistral,' 'ChatGPT,' 'Claude,' 'LLaMA,' 'Gemini,' and 'Falcon.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'Company Structure' at the top center. Below, list exactly these and only these department names as bullet points: 'Human Resources,' 'Finance,' 'Marketing,' 'Sales,' 'Operations,' and 'Research & Development.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'University Departments' at the top center. Below, list exactly these and only these university departments as bullet points: 'Computer Science,' 'Mathematics,' 'Physics,' 'Biology,' 'Economics,' and 'History.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements."
    ]

    # Select the prompt based on the prompt number
    try:
        prompt = prompts[prompt_number - 1]
    except IndexError:
        raise ValueError("Invalid prompt number. Please choose 1, 2, or 3.")

    # Prepare the headers and payload for the request
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}",
    }

    payload = {
        "model": "dall-e-3",
        "prompt": prompt,
        "n": 1,
        "size": "1024x1024"
    }

    # Send the request to OpenAI API
    response = requests.post("https://api.openai.com/v1/images/generations", headers=headers, json=payload)

    # Check for errors in the response
    if response.status_code != 200:
        raise RuntimeError(f"OpenAI API returned an error: {response.status_code} - {response.text}")

    # Extract the image URL from the response
    image_url = response.json()['data'][0]['url']

    # Download and save the image
    image_data = requests.get(image_url).content
    with open(output_path, 'wb') as output_file:
        output_file.write(image_data)

    print(f"Image saved to {output_path}")

if __name__ == "__main__":
    # Check if the correct arguments are provided
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print("Usage: python openai_image_gen.py <output_file_path> [prompt_number]")
        sys.exit(1)

    # Get the output file path from the command-line argument
    output_file_path = sys.argv[1]

    # Get the prompt number, defaulting to 1
    try:
        prompt_number = int(sys.argv[2]) if len(sys.argv) == 3 else 1
    except ValueError:
        print("Prompt number must be an integer (1, 2, or 3).")
        sys.exit(1)

    try:
        generate_image(output_file_path, prompt_number)
    except Exception as e:
        print(f"Error: {e}")

Script OCR

Ce script extrait le texte des images en utilisant les capacités OCR de GPT-4o d'OpenAI. Il encode les images au format Base64, les envoie à l'API et récupère le texte extrait. Le script est conçu pour la précision et la simplicité, ce qui facilite son intégration dans les workflows d'évaluation.

"""
Script Name: OpenAI Image OCR with GPT-4o

Description:
This script extracts text from an image using OpenAI's GPT-4o model. It encodes the image in Base64 format, embeds it into a JSON payload, and sends it to OpenAI's API for processing. The script then extracts and prints the text detected in the image.

Usage:
    python openai_image_ocr.py <image_file_path>

Arguments:
    <image_file_path> : The path to the image file from which text will be extracted.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable for authentication.

Workflow:
    1. Encode the image into a Base64 string.
    2. Send the image as part of a JSON payload to OpenAI's API.
    3. Retrieve and display the text extracted from the image.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Error Handling:
    - Validates the presence of the API key.
    - Checks for a valid image file path as an argument.
    - Handles and displays any errors from the OpenAI API.

Examples:
    1. Extract text from an image file:
        python openai_image_ocr.py example.jpg

    2. Set the API key in the environment and extract text:
        export OPENAI_API_KEY="your-api-key"
        python openai_image_ocr.py example.jpg
"""

import base64
import requests
import os
import sys

def encode_image(image_path):
    """
    Encode an image as a base64 string for embedding in JSON payloads.
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def extract_text_from_image(image_path):
    """
    Extract text from an image using OpenAI's GPT-4o model.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What text do you see in this image? Please provide only the extracted text without any additional commentary."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encode_image(image_path)}"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    res_json = response.json()

    # Extract and return text from the response
    return res_json['choices'][0]['message']['content']

if __name__ == "__main__":
    # Check if an image file path is provided as an argument
    if len(sys.argv) != 2:
        print("Usage: python openai_image_ocr.py <image_file_path>")
        sys.exit(1)

    # Get the image file path from the command line argument
    image_path = sys.argv[1]

    try:
        # Extract text from the image
        extracted_text = extract_text_from_image(image_path)
        print("Extracted Text:")
        print(extracted_text)
    except Exception as e:
        print(f"Error: {e}")

Points à Retenir

Workflow Simple : Ces scripts offrent une fonctionnalité claire pour la génération d'images et l'extraction de texte. Pensez à ajouter une limitation de débit si vous les utilisez pour un traitement par lots afin de gérer efficacement les quotas d'API.
Personnalisable : Ils peuvent être adaptés à divers cas d'utilisation ou intégrés à d'autres outils d'IA. N'oubliez pas de surveiller les coûts de l'API et envisagez de mettre en cache les images générées pour réduire les dépenses.

Vous avez apprécié cet article ? Vous l'avez trouvé utile ? N'hésitez pas à laisser un commentaire ci-dessous pour partager vos réflexions ou poser des questions. Un compte GitHub est requis pour participer à la discussion.