How to Convert a PDF to Markdown for AI and LLM Pipelines

PDF is great for storage but poor for feeding into language models: text can't be selected in scanned files, tables break, and reading order gets scrambled. Converting to Markdown turns a PDF into structured text that AI understands without extra preprocessing.

There are three output formats to choose from. Markdown (.md) preserves headings (#, ##), lists, tables, and paragraphs – ideal for loading into a RAG pipeline, a vector store, or a ChatGPT/Claude context window. JSON with bounding boxes (.json) additionally records the on-page coordinates of every element: exactly where each paragraph, table cell, or heading sits on the sheet. This is needed for citation-style answers ('see page 4, left column'). HTML (.html) is suited for embedding in web apps and knowledge-management systems.

The tool detects page structure: splits text into semantic blocks, restores correct reading order (handled automatically in multi-column layouts), and recognizes tables with merged cells and nested headers. For scanned PDFs without a text layer, run OCR first – the recognized text will then be correctly marked up. Digital PDFs (created in Word, Excel, LaTeX) convert directly and faster.

Practical tip: if the document is going into a RAG system, choose Markdown and split the result into chunks at second-level headings (##) – that is the standard splitter in LangChain and LlamaIndex. If you need precise grounding of answers to a location in the document, use JSON and store the bounding boxes in metadata alongside each vector. For simple copying and editing, Markdown is enough: it opens in Notion, Obsidian, VS Code, and any text editor.

Convert PDF to Markdown