Markdown's fundamental trade-off
Markdown is designed to be readable in its raw form — which means it intentionally omits the complexity that makes rich formats like PDF, DOCX, and HTML powerful. This is a deliberate design choice, not a deficiency. But it does mean that converting from a rich format to Markdown will always involve some information loss, and the degree of loss depends on how much the source document relied on visual formatting to carry meaning.
The losses fall into three practical categories: things that convert cleanly and need no attention, things that require a light cleanup pass after conversion, and things that Markdown fundamentally cannot express and require a different approach. Understanding which category a given feature falls into saves significant time in post-conversion review.
What converts reliably
Text is the foundation, and it comes through cleanly from all major source formats. Paragraphs, headings with correct hierarchy, bold and italic emphasis, inline code, fenced code blocks, ordered and unordered lists, blockquotes, and simple hyperlinks all convert accurately from DOCX, HTML, and well-structured PDFs.
Simple flat tables — one header row, uniform column structure, no merged cells — convert well. The conversion preserves the column structure and renders it as a Markdown table with pipes and dashes. For documents where tables are the primary structured content, a well-formatted source produces Markdown that is immediately usable.
Inline code and code blocks from HTML and Word documents convert cleanly, preserving the distinction between code and prose. Citation-style footnotes often survive as plain text references, even if the linked anchors are lost. Section numbering embedded in heading text is preserved as text within the heading.
What needs a cleanup pass
Complex tables with merged cells, spanning rows, or nested structure frequently flatten during conversion. The text content is preserved but the row and column relationships may collapse into a sequence of list items or prose. For tables where the structure carries important meaning — financial data tables, comparison matrices, schedules with spanning date columns — manually restructuring the Markdown table is usually the right fix. The text is already there; the task is restoring the pipe and dash syntax.
Multi-column page layouts — common in academic papers, newsletters, and magazine-style PDFs — often produce text that mixes content from adjacent columns in the reading order rather than following each column top-to-bottom. When column separation matters, affected sections need manual reordering. For single-column PDFs this is never an issue.
Paginated documents produce repeated boilerplate. Document titles in headers, page numbers in footers, and table of contents entries appearing before each chapter are all repeated throughout the extracted text. A short search-and-delete pass for the repeated pattern is usually all that is needed. For long regulatory documents, this cleanup can remove a substantial amount of noise before the content reaches a downstream system.
Footnotes and endnotes from Word and PDF documents typically survive as text but lose their anchor linking. The footnote content appears somewhere in the output — often at the end of a section or the end of the document — but the superscript reference in the body text that links to it is gone. If footnote provenance is important, manual annotation after conversion reconnects them.
What Markdown cannot express
Visual formatting that carries semantic meaning is the most significant category of permanent loss. Markdown has no concept of text color, background highlighting, or font size differences beyond heading levels one through six. Documents where colored text signals status — red for issues, green for approved, yellow for review — lose those signals entirely. If the status information is important, it needs to be re-encoded in Markdown syntax, typically as inline labels or prefix characters.
Images embedded in a document are not converted to text descriptions automatically. The surrounding text converts normally, and the image is represented as a placeholder or omitted entirely. If the image content carries information that the document relies on — a chart, a diagram, a screenshot with annotated callouts — that information is absent from the Markdown. It requires either manual description or processing through a vision-capable model separately.
Mathematical equations present a well-known challenge. LaTeX equations in a LaTeX-produced PDF may survive as raw LaTeX strings, which Markdown-capable renderers can display correctly. But equations created with Word's equation editor or rendered as images within a PDF typically convert to fragments, placeholders, or are dropped. Math-intensive documents often require significant post-conversion equation repair.
Tracked changes and review comments in Word documents are invisible in the converted output. The final accepted text is what appears in Markdown — edit history, margin comments, and reviewer notes are not preserved. If the review state of a document matters, export the Word file with tracked changes accepted before converting.
Practical advice for post-conversion review
For most text-centric documents, a five-minute review pass is enough to catch the main issues. Start by scanning the beginning and end of the document, where headers and footers tend to cluster. Check any tables for structural integrity. Verify that the heading hierarchy in the output matches the section structure you expected from the source.
For high-stakes documents — legal contracts, technical specifications, financial reports with regulatory implications — treat the Markdown as a working draft and validate key values against the source. Numbers, dates, proper nouns, and table values are the most important to verify manually. Conversion can silently misread a character in a number or transpose a date.
For documents that will be fed to an LLM, consider noting any sections of uncertain conversion quality in your prompt. Telling the model where you expect OCR variability or table flattening allows it to hedge appropriately in those sections rather than producing confident responses based on potentially degraded input.