AI Themes Logo

aithemes.net

Ever Wanted to Convert Your Documents to Markdown? Evaluating MarkItDown with Practical Cases

Explore how MarkItDown, an open-source tool by Microsoft, excels in converting PDFs, Excel sheets, and images to Markdown through real-world examples.

9 min read

Created: Dec 21 2024Last Update: Dec 21 2024
#MarkItDown#Content Conversion#Markdown Tools#Open Source#CLI Tools#PDF to Markdown#Excel to Markdown#Image to Text#Python#Microsoft

Post image

This post evaluates MarkItDown, an open-source tool by Microsoft designed to facilitate seamless content conversion into Markdown format. From PDFs to Excel files, and even images, MarkItDown empowers users to streamline their workflows with its intuitive interface and robust features. Whether you're a developer, content creator, or researcher, this guide will demonstrate its practical applications through real-world examples.

Explore more about Markdown tools in my blog entry here.

Introduction to MarkItDown

MarkItDown is a command-line tool tailored to convert diverse content formats into Markdown. Its flexibility and ease of use make it an indispensable asset for those who frequently work with Markdown documentation. Supporting formats like PDFs, Excel sheets, and images, MarkItDown ensures that converting content into structured Markdown files is both straightforward and efficient.

Installation Instructions

Setting up MarkItDown is quick and hassle-free. Follow these steps to get started:

  1. Create a Python Virtual Environment and Activate It:

python3 -m venv venv

source venv/bin/activate

  1. Install MarkItDown with pip:

pip install markitdown

For more detailed instructions, refer to the official GitHub page.

Practical Use Cases for MarkItDown

Below are practical examples demonstrating how to leverage MarkItDown for converting different content types into Markdown. These examples highlight the tool’s capabilities and flexibility.

Case 1: Converting a PDF File to Markdown

The first use case demonstrates converting an NVIDIA specification PDF to Markdown format. You can download the PDF here. Execute the following command to achieve this:

markitdown data/jetson-orin-datasheet-nano-developer-kit-3575392-r2.pdf > data/jetson-orin-datasheet-nano-developer-kit-3575392-r2.md

Using MarkItDown’s command-line capabilities, the PDF will be converted into a well-structured Markdown file. Below is a visual representation of the original PDF and the converted Markdown file.

Original PDF

Figure 1: Original NVIDIA specification PDF.

Converted Markdown File

Figure 2: Converted Markdown file from NVIDIA specification PDF.

Case 2: Converting an Excel File to Markdown

The second use case involves converting an Excel sheet detailing LLM features into Markdown format. Execute the following command to perform the conversion:

markitdown data/LLM_Models_Info.xlsx > data/LLM_Models_Info.md

MarkItDown’s ability to handle tabular data ensures that the output retains its structure and readability. Below are the visuals of the original Excel file and the converted Markdown file.

Original Excel File

Figure 3: Original Excel file with LLM features.

Converted Markdown File

Figure 4: Converted Markdown file from Excel sheet.

Case 3: Converting a Word File to Markdown

The third use case involves converting a Word document detailing LLM features with a table of contents into Markdown format. Execute the following command to perform the conversion:

markitdown data/LLMs_features_with_TOC.docx > data/LLMs_features_with_TOC.md

MarkItDown’s ability to handle structured text ensures that the output retains its formatting, including the table of contents and nested headings, making it highly readable and organized. Below are the visuals of the original Word file and the converted Markdown file.

Original Word File

Figure 5: Original Word file with LLM features and table of contents.

Converted Markdown File

Figure 6: Converted Markdown file from Word document with table of contents and nested headings.

Case 4: Extracting Text from an Image

MarkItDown also supports text extraction from images. For this case, we’ll use the NVIDIA specification PDF rendered as an image to compare the extracted text with the original.

To enhance this functionality, we’ll use OpenAI’s GPT-4o model via the OpenAI API, leveraging a script for text extraction. The script processes the image and outputs Markdown content.

Note: You need to have an OpenAI API key to use this feature. Ensure you have set up your OpenAI API key in your environment before running the script.

The following script is used following the example script in the MarkItDown GitHub repository.

"""
Script: convert_image_to_text.py

Description:
This script converts the text content from an input image file to plain text using the MarkItDown library and OpenAI's GPT-4o model. The input image file and the output text file are specified as command-line arguments.

Usage:
  python convert_image_to_text.py <input_image> <output_file>

Arguments:
  <input_image>: Path to the image file to be processed.
  <output_file>: Path to the file where the extracted text will be saved.

Dependencies:
  - markitdown
  - openai
  - Python 3.x

Example:
  python convert_image_to_text.py example.jpg output.txt
"""

import sys
from markitdown import MarkItDown
from openai import OpenAI

# Ensure command-line arguments are provided
if len(sys.argv) != 3:
  print("Usage: python script.py <input_image> <output_file>")
  sys.exit(1)

input_image = sys.argv[1]  # Input image file
output_file = sys.argv[2]  # Output text file

# Initialize the LLM client and MarkItDown
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

try:
  # Convert the image to text
  result = md.convert(input_image)

  # Write the result to the output file
  with open(output_file, "w", encoding="utf-8") as f:
      f.write(result.text_content)

  print(f"Conversion successful. Output written to {output_file}")

except Exception as e:
  print(f"Error: {e}")

Execute the following command to perform the conversion:

python convert_image_to_text.py data/jetson-orin-datasheet-nano-developer-kit-3575392-r2.png data/jetson-orin-datasheet-nano-developer-kit-3575392-r2_jpg.md

Below are the visuals of the original image and the extracted Markdown content.

Original Image

Figure 7: Image rendered from NVIDIA specification PDF.

Extracted Markdown

Figure 8: Extracted Markdown content from the image.

Unlock the Power of MarkItDown Today!

  • Open Source Advantage: MarkItDown’s open-source nature ensures accessibility and continuous community-driven improvements.
  • Versatility Across Formats: Whether PDFs, Excel sheets, or images, MarkItDown simplifies the conversion process while maintaining the integrity of the original content.
  • Enhanced with AI: Combining MarkItDown with AI tools like OpenAI’s GPT-4o unlocks additional functionalities, such as accurate text extraction from images.
  • Efficient and User-Friendly: Its command-line interface ensures straightforward usage for both technical and non-technical users.
  • Easy Installation: Setting up MarkItDown is quick and hassle-free, making it accessible for users of all skill levels.
  • Satisfactory Results: In the cases demonstrated in this tutorial, MarkItDown consistently produced accurate and well-formatted Markdown files, preserving the structure and readability of the original content, including sections, headings, and tables of contents.

Discover the power of MarkItDown and integrate it into your content conversion workflows today!

Check Out Other Tutorials on My Blog

If you found this tutorial helpful, you might enjoy these as well:


Enjoyed this post? Found it helpful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.