← All blueprints

Markdown Converter

Convert PDFs to structured markdown with AI

PDFs are notoriously hard to convert accurately. This blueprint builds an AI-powered converter that handles the full spectrum — text PDFs, scanned documents, complex layouts with columns, tables, headers, and embedded images — and produces clean, well-structured markdown.

Stack

EigenForge Agent ForgePDF parser (pdfplumber, PyMuPDF)Vision model for scanned PDFsLLM for structure inference

Implementation

  1. 1

    Classify the PDF type

    The agent determines whether the PDF is text-based, scanned, or mixed. Routes to the appropriate extraction pipeline based on classification.

  2. 2

    Extract and parse content

    For text PDFs, extract text with position data. For scanned PDFs, use OCR with vision model enhancement. Preserve reading order in multi-column layouts.

  3. 3

    Identify document structure

    The agent infers heading hierarchy, table boundaries, list structures, and code blocks from visual layout and formatting cues.

  4. 4

    Generate structured markdown

    Convert the parsed structure into markdown. Handle tables (including merged cells), nested lists, footnotes, and cross-references.

  5. 5

    Validate and clean up

    Compare page-by-page against the original. Flag any conversion issues. Clean up artifacts like page numbers, headers/footers, and hyphenation.

What You Get

  • Handles text PDFs, scanned docs, and mixed layouts
  • Tables with merged cells correctly converted to markdown
  • Reading order preserved in multi-column documents
  • Page-by-page validation against the original PDF

Ready to build this?

Join the Waitlist