Creando Scripts en Python: Generación de Imágenes con DALL-E 3 y OCR con GPT-4o usando OpenAI

Introducción

La generación precisa de texto en imágenes generadas por IA es crucial para aplicaciones como presentaciones, marketing y contenido educativo. En la publicación anterior, Evaluación de la Precisión del Texto en Imágenes Generadas por IA: Una Comparación de DALL-E 3 y Mistral, exploré cómo estos modelos manejan la generación de texto. Esa evaluación se basó en dos scripts de Python, que son el enfoque de esta publicación:

Un script para generar imágenes con texto específico usando DALL-E 3 a través de la API de OpenAI.
Un script para extraer y verificar texto de las imágenes generadas usando OCR (Reconocimiento Óptico de Caracteres) impulsado por GPT-4o, también a través de la API de OpenAI.

Ambos scripts requieren una clave de API válida de OpenAI para funcionar. Asegúrate de haber configurado una clave de API en tu entorno como requisito previo antes de ejecutar estos scripts. La clave permite la autenticación y habilita a los scripts para interactuar con los endpoints de OpenAI para la generación de imágenes y la extracción de texto.

Esta publicación proporciona una descripción general de estos scripts, destacando su propósito y papel en la evaluación de la precisión del texto. ¡Vamos a ello!

Script de Generación de Imágenes

Este script utiliza la API de DALL-E 3 de OpenAI para generar imágenes con texto preciso basado en prompts predefinidos. Acepta un argumento de línea de comandos para seleccionar el prompt y guarda la imagen generada en una ruta de archivo especificada. El script incluye manejo de errores para las interacciones con la API y valida las entradas.

"""
Script Name: DALL-E 3 Image Generator

Description:
This script generates an image using OpenAI's DALL-E 3 model based on a predefined prompt and saves it to the specified output path.
It allows users to select one of three predefined prompts via a command-line argument.

Usage:
    python openai_image_gen.py <output_file_path> [prompt_number]

Arguments:
    <output_file_path> : The path where the generated image will be saved.
    [prompt_number]    : Optional. Specifies which prompt to use. Must be 1, 2, or 3. Defaults to 1.

Prompts:
    1. A professional presentation slide titled "Large Language Models (LLMs)" listing specific LLM names.
    2. A professional presentation slide titled "Company Structure" listing specific department names.
    3. A professional presentation slide titled "University Departments" listing specific academic departments.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Examples:
    1. Generate an image with the first prompt and save to output_image.png:
        python openai_image_gen.py output_image.png

    2. Generate an image with the second prompt and save to output_image.png:
        python openai_image_gen.py output_image.png 2

Error Handling:
    - The script validates the presence of the API key.
    - Ensures the prompt number is within the valid range (1–3).
    - Provides detailed error messages for API failures.

"""

import os
import requests
import sys

def generate_image(output_path, prompt_number):
    """
    Generate an image using OpenAI's DALL-E 3 and save it to the specified output path.

    Parameters:
        output_path (str): Path to save the generated image.
        prompt_number (int): The prompt number to use (1, 2, or 3).
    """
    # Get the API key from the environment variable
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    # Define prompts
    prompts = [
        "A clean and professional presentation slide design with the title 'Large Language Models (LLMs)' at the top center. Below, list exactly these and only these names of LLMs as bullet points: 'Mistral,' 'ChatGPT,' 'Claude,' 'LLaMA,' 'Gemini,' and 'Falcon.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'Company Structure' at the top center. Below, list exactly these and only these department names as bullet points: 'Human Resources,' 'Finance,' 'Marketing,' 'Sales,' 'Operations,' and 'Research & Development.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
        "A clean and professional presentation slide design with the title 'University Departments' at the top center. Below, list exactly these and only these university departments as bullet points: 'Computer Science,' 'Mathematics,' 'Physics,' 'Biology,' 'Economics,' and 'History.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements."
    ]

    # Select the prompt based on the prompt number
    try:
        prompt = prompts[prompt_number - 1]
    except IndexError:
        raise ValueError("Invalid prompt number. Please choose 1, 2, or 3.")

    # Prepare the headers and payload for the request
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}",
    }

    payload = {
        "model": "dall-e-3",
        "prompt": prompt,
        "n": 1,
        "size": "1024x1024"
    }

    # Send the request to OpenAI API
    response = requests.post("https://api.openai.com/v1/images/generations", headers=headers, json=payload)

    # Check for errors in the response
    if response.status_code != 200:
        raise RuntimeError(f"OpenAI API returned an error: {response.status_code} - {response.text}")

    # Extract the image URL from the response
    image_url = response.json()['data'][0]['url']

    # Download and save the image
    image_data = requests.get(image_url).content
    with open(output_path, 'wb') as output_file:
        output_file.write(image_data)

    print(f"Image saved to {output_path}")

if __name__ == "__main__":
    # Check if the correct arguments are provided
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print("Usage: python openai_image_gen.py <output_file_path> [prompt_number]")
        sys.exit(1)

    # Get the output file path from the command-line argument
    output_file_path = sys.argv[1]

    # Get the prompt number, defaulting to 1
    try:
        prompt_number = int(sys.argv[2]) if len(sys.argv) == 3 else 1
    except ValueError:
        print("Prompt number must be an integer (1, 2, or 3).")
        sys.exit(1)

    try:
        generate_image(output_file_path, prompt_number)
    except Exception as e:
        print(f"Error: {e}")

Script de OCR

Este script extrae texto de imágenes utilizando las capacidades de OCR de GPT-4o de OpenAI. Codifica las imágenes en formato Base64, las envía a la API y recupera el texto extraído. El script está diseñado para ser preciso y simple, facilitando su integración en flujos de trabajo de evaluación.

"""
Script Name: OpenAI Image OCR with GPT-4o

Description:
This script extracts text from an image using OpenAI's GPT-4o model. It encodes the image in Base64 format, embeds it into a JSON payload, and sends it to OpenAI's API for processing. The script then extracts and prints the text detected in the image.

Usage:
    python openai_image_ocr.py <image_file_path>

Arguments:
    <image_file_path> : The path to the image file from which text will be extracted.

Environment Variable:
    OPENAI_API_KEY: The OpenAI API key must be set as an environment variable for authentication.

Workflow:
    1. Encode the image into a Base64 string.
    2. Send the image as part of a JSON payload to OpenAI's API.
    3. Retrieve and display the text extracted from the image.

Dependencies:
    - Python 3.7 or higher
    - requests library (install via pip install requests)

Error Handling:
    - Validates the presence of the API key.
    - Checks for a valid image file path as an argument.
    - Handles and displays any errors from the OpenAI API.

Examples:
    1. Extract text from an image file:
        python openai_image_ocr.py example.jpg

    2. Set the API key in the environment and extract text:
        export OPENAI_API_KEY="your-api-key"
        python openai_image_ocr.py example.jpg
"""

import base64
import requests
import os
import sys

def encode_image(image_path):
    """
    Encode an image as a base64 string for embedding in JSON payloads.
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def extract_text_from_image(image_path):
    """
    Extract text from an image using OpenAI's GPT-4o model.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set the OPENAI_API_KEY environment variable.")

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What text do you see in this image? Please provide only the extracted text without any additional commentary."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encode_image(image_path)}"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    res_json = response.json()

    # Extract and return text from the response
    return res_json['choices'][0]['message']['content']

if __name__ == "__main__":
    # Check if an image file path is provided as an argument
    if len(sys.argv) != 2:
        print("Usage: python openai_image_ocr.py <image_file_path>")
        sys.exit(1)

    # Get the image file path from the command line argument
    image_path = sys.argv[1]

    try:
        # Extract text from the image
        extracted_text = extract_text_from_image(image_path)
        print("Extracted Text:")
        print(extracted_text)
    except Exception as e:
        print(f"Error: {e}")

Conclusiones

Flujo de Trabajo Sencillo: Estos scripts proporcionan una funcionalidad clara para la generación de imágenes y la extracción de texto. Considera agregar limitación de tasa si los usas para procesamiento por lotes para manejar las cuotas de la API de manera efectiva.
Personalizable: Pueden adaptarse para diversos casos de uso o integrarse con otras herramientas de IA. Recuerda monitorear los costos de la API y considerar el almacenamiento en caché de imágenes generadas para reducir gastos.

¿Disfrutaste esta publicación? ¿Te resultó útil? No dudes en dejar un comentario a continuación para compartir tus pensamientos o hacer preguntas. Se requiere una cuenta de GitHub para unirse a la discusión.