Guide · 7 min read

What Gets Lost When You Convert Documents to Markdown

Every document-to-Markdown conversion involves some loss. This is not a bug — it is Markdown's design. Understanding what is lost, and what to do about it, is what separates a quick draft from a clean final output.

Published April 17, 2026 · Updated April 17, 2026

Editorial details

Written by

any2markdown Editorial Team

Reviewed against

Current any2markdown output

Method

Hands-on conversion tests and published source docs

This guide is maintained alongside the live MarkItDown-backed workflow on any2markdown.com. Where a guide compares tools or workflows, the links in the references section point to the original project documentation or repositories.

Markdown's fundamental trade-off

Markdown is designed to be readable in its raw form — which means it intentionally omits the complexity that makes rich formats like PDF, DOCX, and HTML powerful. This is a deliberate design choice, not a deficiency. But it does mean that converting from a rich format to Markdown will always involve some information loss, and the degree of loss depends on how much the source document relied on visual formatting to carry meaning.

The losses fall into three practical categories: things that convert cleanly and need no attention, things that require a light cleanup pass after conversion, and things that Markdown fundamentally cannot express and require a different approach. Understanding which category a given feature falls into saves significant time in post-conversion review.

What converts reliably

Text is the foundation, and it comes through cleanly from all major source formats. Paragraphs, headings with correct hierarchy, bold and italic emphasis, inline code, fenced code blocks, ordered and unordered lists, blockquotes, and simple hyperlinks all convert accurately from DOCX, HTML, and well-structured PDFs.

Simple flat tables — one header row, uniform column structure, no merged cells — convert well. The conversion preserves the column structure and renders it as a Markdown table with pipes and dashes. For documents where tables are the primary structured content, a well-formatted source produces Markdown that is immediately usable.

Inline code and code blocks from HTML and Word documents convert cleanly, preserving the distinction between code and prose. Citation-style footnotes often survive as plain text references, even if the linked anchors are lost. Section numbering embedded in heading text is preserved as text within the heading.

What needs a cleanup pass

Complex tables with merged cells, spanning rows, or nested structure frequently flatten during conversion. The text content is preserved but the row and column relationships may collapse into a sequence of list items or prose. For tables where the structure carries important meaning — financial data tables, comparison matrices, schedules with spanning date columns — manually restructuring the Markdown table is usually the right fix. The text is already there; the task is restoring the pipe and dash syntax.

Multi-column page layouts — common in academic papers, newsletters, and magazine-style PDFs — often produce text that mixes content from adjacent columns in the reading order rather than following each column top-to-bottom. When column separation matters, affected sections need manual reordering. For single-column PDFs this is never an issue.

Paginated documents produce repeated boilerplate. Document titles in headers, page numbers in footers, and table of contents entries appearing before each chapter are all repeated throughout the extracted text. A short search-and-delete pass for the repeated pattern is usually all that is needed. For long regulatory documents, this cleanup can remove a substantial amount of noise before the content reaches a downstream system.

Footnotes and endnotes from Word and PDF documents typically survive as text but lose their anchor linking. The footnote content appears somewhere in the output — often at the end of a section or the end of the document — but the superscript reference in the body text that links to it is gone. If footnote provenance is important, manual annotation after conversion reconnects them.

What Markdown cannot express

Visual formatting that carries semantic meaning is the most significant category of permanent loss. Markdown has no concept of text color, background highlighting, or font size differences beyond heading levels one through six. Documents where colored text signals status — red for issues, green for approved, yellow for review — lose those signals entirely. If the status information is important, it needs to be re-encoded in Markdown syntax, typically as inline labels or prefix characters.

Images embedded in a document are not converted to text descriptions automatically. The surrounding text converts normally, and the image is represented as a placeholder or omitted entirely. If the image content carries information that the document relies on — a chart, a diagram, a screenshot with annotated callouts — that information is absent from the Markdown. It requires either manual description or processing through a vision-capable model separately.

Mathematical equations present a well-known challenge. LaTeX equations in a LaTeX-produced PDF may survive as raw LaTeX strings, which Markdown-capable renderers can display correctly. But equations created with Word's equation editor or rendered as images within a PDF typically convert to fragments, placeholders, or are dropped. Math-intensive documents often require significant post-conversion equation repair.

Tracked changes and review comments in Word documents are invisible in the converted output. The final accepted text is what appears in Markdown — edit history, margin comments, and reviewer notes are not preserved. If the review state of a document matters, export the Word file with tracked changes accepted before converting.

Practical advice for post-conversion review

For most text-centric documents, a five-minute review pass is enough to catch the main issues. Start by scanning the beginning and end of the document, where headers and footers tend to cluster. Check any tables for structural integrity. Verify that the heading hierarchy in the output matches the section structure you expected from the source.

For high-stakes documents — legal contracts, technical specifications, financial reports with regulatory implications — treat the Markdown as a working draft and validate key values against the source. Numbers, dates, proper nouns, and table values are the most important to verify manually. Conversion can silently misread a character in a number or transpose a date.

For documents that will be fed to an LLM, consider noting any sections of uncertain conversion quality in your prompt. Telling the model where you expect OCR variability or table flattening allows it to hedge appropriately in those sections rather than producing confident responses based on potentially degraded input.

Try the conversion and see for yourself

Upload your document and review the Markdown output. The preview shows you exactly what converts before you download.

Convert now

Frequently asked questions

Will my document's formatting be preserved in Markdown?

Text structure transfers reliably: headings, lists, bold, italic, code blocks, and simple tables all convert cleanly. Visual formatting — font sizes, colors, background highlights, complex table styles — does not have Markdown equivalents and is lost during conversion.

What happens to embedded images in a PDF?

Images are not automatically converted to text descriptions. The surrounding text converts normally. If image content is critical to your use case, you will need to manually describe the images in the output or process them separately with a vision model.

Can I convert a PDF with charts to Markdown?

The text around charts converts, but the chart content itself is not represented as Markdown. Charts need to be described manually or processed separately with an image understanding model if their content is important.

Are there documents where Markdown conversion doesn't make sense?

Yes. Documents where the primary value is visual — architectural drawings, infographic-heavy slides, design documents, brochures — produce thin or empty Markdown output with little of the original value intact. These are better handled with image understanding tools rather than text-based conversion.

What about scanned PDFs with handwritten notes?

Handwriting is the most challenging input for OCR-based conversion. Print-quality handwriting on a clean background may convert partially. Casual or cursive handwriting on a mixed background typically produces low-accuracy output that requires significant manual correction.

References

Related