Building Python Scripts: DALL-E 3 Image Generation and GPT-4o OCR with OpenAI

Introduction

Accurate text generation in AI-generated images is critical for applications like presentations, marketing, and educational content. In the previous post, Evaluating Text Precision in AI-Generated Images: A Comparison of DALL-E 3 and Mistral, I explored how these models handle text generation. That evaluation relied on two Python scripts, which are the focus of this post:

A script to generate images with specific text using DALL-E 3 via the OpenAI API.
A script to extract and verify text from the generated images using OCR (Optical Character Recognition) powered by GPT-4o, also via the OpenAI API.

Both scripts require a valid OpenAI API key to function. Ensure you have set up an API key in your environment as a prerequisite before running these scripts. The key enables authentication and allows the scripts to interact with OpenAI's endpoints for image generation and text extraction.

This post provides an overview of these scripts, highlighting their purpose and role in evaluating text precision. Let’s dive in!

Image Generation Script

This script uses OpenAI's DALL-E 3 API to generate images with precise text based on predefined prompts. It accepts a command-line argument for selecting the prompt and saves the generated image to a specified file path. The script ensures error handling for API interactions and validates inputs.

"""
Script Name: DALL-E 3 Image Generator

Description:
This script generates an image using OpenAI's DALL-E 3 model based on a predefined prompt and saves it to the specified output path. 
It allows users to select one of three predefined prompts via a command-line argument.

Usage:
  python openai_image_gen.py <output_file_path> [prompt_number]

Arguments:
  <output_file_path> : The path where the generated image will be saved.
  [prompt_number]    : Optional. Specifies which prompt to use. Must be 1, 2, or 3. Defaults to 1.

Prompts:
  1. A professional presentation slide titled "Large Language Models (LLMs)" listing specific LLM names.
  2. A professional presentation slide titled "Company Structure" listing specific department names.
  3. A professional presentation slide titled "University Departments" listing specific academic departments.

Environment Variable:
  OPENAI_API_KEY: The OpenAI API key must be set as an environment variable.

Dependencies:
  - Python 3.7 or higher
  - requests library (install via pip install requests)

Examples:
  1. Generate an image with the first prompt and save to output_image.png:
      python openai_image_gen.py output_image.png

  2. Generate an image with the second prompt and save to output_image.png:
      python openai_image_gen.py output_image.png 2

Error Handling:
  - The script validates the presence of the API key.
  - Ensures the prompt number is within the valid range (1–3).
  - Provides detailed error messages for API failures.

"""

import os
import requests
import sys

def generate_image(output_path, prompt_number):
  """
  Generate an image using OpenAI's DALL-E 3 and save it to the specified output path.

  Parameters:
      output_path (str): Path to save the generated image.
      prompt_number (int): The prompt number to use (1, 2, or 3).
  """
  # Get the API key from the environment variable
  api_key = os.getenv("OPENAI_API_KEY")
  if not api_key:
      raise ValueError("Please set the OPENAI_API_KEY environment variable.")

  # Define prompts
  prompts = [
      "A clean and professional presentation slide design with the title 'Large Language Models (LLMs)' at the top center. Below, list exactly these and only these names of LLMs as bullet points: 'Mistral,' 'ChatGPT,' 'Claude,' 'LLaMA,' 'Gemini,' and 'Falcon.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
      "A clean and professional presentation slide design with the title 'Company Structure' at the top center. Below, list exactly these and only these department names as bullet points: 'Human Resources,' 'Finance,' 'Marketing,' 'Sales,' 'Operations,' and 'Research & Development.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements.",
      "A clean and professional presentation slide design with the title 'University Departments' at the top center. Below, list exactly these and only these university departments as bullet points: 'Computer Science,' 'Mathematics,' 'Physics,' 'Biology,' 'Economics,' and 'History.' Use a plain white background with simple black text to ensure clarity, and no other text or decorative elements."
  ]

  # Select the prompt based on the prompt number
  try:
      prompt = prompts[prompt_number - 1]
  except IndexError:
      raise ValueError("Invalid prompt number. Please choose 1, 2, or 3.")

  # Prepare the headers and payload for the request
  headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {api_key}",
  }

  payload = {
      "model": "dall-e-3",
      "prompt": prompt,
      "n": 1,
      "size": "1024x1024"
  }

  # Send the request to OpenAI API
  response = requests.post("https://api.openai.com/v1/images/generations", headers=headers, json=payload)

  # Check for errors in the response
  if response.status_code != 200:
      raise RuntimeError(f"OpenAI API returned an error: {response.status_code} - {response.text}")

  # Extract the image URL from the response
  image_url = response.json()['data'][0]['url']

  # Download and save the image
  image_data = requests.get(image_url).content
  with open(output_path, 'wb') as output_file:
      output_file.write(image_data)

  print(f"Image saved to {output_path}")

if __name__ == "__main__":
  # Check if the correct arguments are provided
  if len(sys.argv) < 2 or len(sys.argv) > 3:
      print("Usage: python openai_image_gen.py <output_file_path> [prompt_number]")
      sys.exit(1)

  # Get the output file path from the command-line argument
  output_file_path = sys.argv[1]

  # Get the prompt number, defaulting to 1
  try:
      prompt_number = int(sys.argv[2]) if len(sys.argv) == 3 else 1
  except ValueError:
      print("Prompt number must be an integer (1, 2, or 3).")
      sys.exit(1)

  try:
      generate_image(output_file_path, prompt_number)
  except Exception as e:
      print(f"Error: {e}")

OCR Script

This script extracts text from images using OpenAI's GPT-4o OCR capabilities. It encodes images in Base64 format, sends them to the API, and retrieves the extracted text. The script is designed for accuracy and simplicity, making it easy to integrate into evaluation workflows.

"""
Script Name: OpenAI Image OCR with GPT-4o

Description:
This script extracts text from an image using OpenAI's GPT-4o model. It encodes the image in Base64 format, embeds it into a JSON payload, and sends it to OpenAI's API for processing. The script then extracts and prints the text detected in the image.

Usage:
  python openai_image_ocr.py <image_file_path>

Arguments:
  <image_file_path> : The path to the image file from which text will be extracted.

Environment Variable:
  OPENAI_API_KEY: The OpenAI API key must be set as an environment variable for authentication.

Workflow:
  1. Encode the image into a Base64 string.
  2. Send the image as part of a JSON payload to OpenAI's API.
  3. Retrieve and display the text extracted from the image.

Dependencies:
  - Python 3.7 or higher
  - requests library (install via pip install requests)

Error Handling:
  - Validates the presence of the API key.
  - Checks for a valid image file path as an argument.
  - Handles and displays any errors from the OpenAI API.

Examples:
  1. Extract text from an image file:
      python openai_image_ocr.py example.jpg

  2. Set the API key in the environment and extract text:
      export OPENAI_API_KEY="your-api-key"
      python openai_image_ocr.py example.jpg
"""

import base64
import requests
import os
import sys

def encode_image(image_path):
  """
  Encode an image as a base64 string for embedding in JSON payloads.
  """
  with open(image_path, "rb") as image_file:
      return base64.b64encode(image_file.read()).decode('utf-8')

def extract_text_from_image(image_path):
  """
  Extract text from an image using OpenAI's GPT-4o model.
  """
  api_key = os.getenv("OPENAI_API_KEY")
  if not api_key:
      raise ValueError("Please set the OPENAI_API_KEY environment variable.")

  headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {api_key}"
  }

  payload = {
      "model": "gpt-4o",
      "messages": [
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": "What text do you see in this image? Please provide only the extracted text without any additional commentary."
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": f"data:image/jpeg;base64,{encode_image(image_path)}"
                      }
                  }
              ]
          }
      ],
      "max_tokens": 300
  }

  response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
  res_json = response.json()

  # Extract and return text from the response
  return res_json['choices'][0]['message']['content']

if __name__ == "__main__":
  # Check if an image file path is provided as an argument
  if len(sys.argv) != 2:
      print("Usage: python openai_image_ocr.py <image_file_path>")
      sys.exit(1)

  # Get the image file path from the command line argument
  image_path = sys.argv[1]

  try:
      # Extract text from the image
      extracted_text = extract_text_from_image(image_path)
      print("Extracted Text:")
      print(extracted_text)
  except Exception as e:
      print(f"Error: {e}")

Takeaways

Straightforward Workflow: These scripts provide clear functionality for image generation and text extraction. Consider adding rate limiting if using them for batch processing to handle API quotas effectively.
Customizable: They can be adapted to suit various use cases or integrated with other AI tools. Remember to monitor API costs and consider caching generated images to reduce expenses.

Enjoyed this post? Found it helpful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.