PDF2Markdown: PDF and Image to Markdown Conversion API for LLM Pipelines

PDF2Markdown, an indie-built API service at pdf2markdown.io, offers developers a single REST endpoint for converting PDFs and images into Markdown and JSON output. The tool targets a foundational layer in LLM pipelines — document ingestion — by accepting URL, base64, or multipart file uploads and returning structured output in a single response. Supported input formats include JPEG, PNG, GIF, WebP, TIFF, BMP, and both native and image-based PDFs. A free tier covers 100 pages per month, with additional usage billed at one credit per page across all plans.

The service's main claim is handling scanned and image-based PDFs without requiring a separate OCR pipeline — something traditional tools like Tesseract or PDFMiner cannot do natively. The API preserves document structure including headings, tables, and lists, and its creator positions the output as directly usable in RAG workflows, knowledge base ingestion, and content migration pipelines. According to the product documentation, an async endpoint supports files up to 100MB for larger workloads, complementing the synchronous API for real-time use cases.

A conspicuous gap in PDF2Markdown's documentation is any disclosure of the underlying parsing technology. The combination of layout-aware parsing, image PDF support, and single-step structured output is consistent with a vision-capable multimodal LLM — candidates being GPT-4o, a Claude model, or Gemini — interpreting document pages as images. The creator has not addressed this publicly, either on Hacker News or in the API docs. For teams evaluating the tool for production use, this opacity raises meaningful questions: if inference is routed through a frontier model API, document content leaves the customer's infrastructure, per-page costs reflect model inference pricing rather than deterministic parsing, and there is a non-zero hallucination risk on structured fields like invoice totals or contract terms — a failure mode with no equivalent in traditional OCR.

PDF2Markdown fits a pattern that has become common in the AI agent ecosystem: AI-powered preprocessing layers that sit upstream of other AI systems. The contrast with tools like Azure Document Intelligence, AWS Textract, or Google Document AI is instructive — all three disclose their models, return confidence scores, and offer explicit structure APIs. PDF2Markdown trades that interpretability and auditability for simplicity. That tradeoff may be acceptable for low-stakes content pipelines, but builders running agentic workflows over sensitive documents will need to weigh the tool's convenience against the unresolved questions about what model processes their data and where.