This document records the current implementation-aligned behavior of docx2md.
It complements:
miku-xlsx2mdWhen implementation behavior differs from idealized design intent, this document should describe the implementation behavior explicitly.
The current first cut includes:
.docx ZIP entry reading<br>numbering.xml←M← and ↑M↑The current first cut still excludes:
The current end-to-end flow is:
.docx file as bytesword/document.xmlword/_rels/document.xml.rels, word/styles.xml, word/numbering.xml, and [Content_Types].xmlblocks, summary, and any resolved sidecar assetsCurrent behavior:
word/document.xml is mandatoryIf word/document.xml is missing, parsing fails with an explicit error.
Current behavior:
DOMParserlocalName rather than namespace prefix spellingWhitespace is not aggressively normalized at XML-read time. Most text normalization happens during inline extraction and Markdown rendering.
Current behavior:
word/_rels/document.xml.rels is parsed into a map keyed by relationship idword/document.xmlFor hyperlink rendering:
r:id with a known external relationship becomes an external Markdown linkr:id with an internal fragment target becomes an internal Markdown link only when the normalized target matches a known bookmark ownerw:anchor target is availableCurrent behavior:
Heading n / 見出し n recognition is supportedstyles.xmloutlineLvl acts as a fallback when style-based heading detection does not resolve a levelbasedOn chains are cut off safely during resolutionHeading levels are clamped to Markdown heading range 1..6.
Current behavior:
pPr/rPr override inherited paragraph-style text formattingrStyle may contribute inherited run formattingrPr overrides both paragraph-derived and character-style-derived text formattingw:val="0" / false on supported run-format flags disables inherited formatting for that scopew:br becomes <br>Hyperlink text suppresses underline wrapping so that link syntax is not nested with underline output.
Current behavior:
[text](url)w:anchor render as [text](#anchor) only when the normalized anchor matches a known bookmark owner#anchor follow the same known-anchor checkbookmarkStart are collected as block anchor ids_ are ignored<a id="anchor"></a>-, replaces unsupported punctuation with -, and collapses repeated -Current behavior:
numbering.xml is parsed into abstractNum and num mappingsnumId and ilvl are used to determine list kind and nesting depth-1.4 spaces per nesting levelIf numbering metadata cannot be resolved, the paragraph falls back to ordinary paragraph behavior rather than forcing a synthetic list.
Current behavior:
w:trw:tc<br><br>## Heading- or 1. ←M←↑M↑←M←Markdown rendering uses the first row as the header row.
Current behavior:
# through ######Anchor rendering is inserted immediately before the owning paragraph, heading, or list item block.
Current summary fields are:
paragraphsheadingslistItemstablesimagesimageAssetsdrawingLikeUnsupportedlinksinternalLinksexternalLinksunsupportedElementsunsupportedCommentTracesunsupportedCommentTraces currently counts both standalone unsupported blocks and unsupported traces attached to supported blocks.
Unsupported element traces currently use a small normalized category set for common cases:
drawing for drawing-like elements such as drawing, pict, and objecttextbox for textbox-like elements such as txbxContentchart for chartlocalName is usedWhen unsupported traces are rendered as debug HTML comments, comment-breaking sequences from source metadata are sanitized so the debug output does not prematurely close the comment.
For drawing-like unsupported elements, when an embedded image relationship can be resolved safely, the current debug trace may include the package target in a form such as:
drawing:image(word/media/example.png)When drawing metadata exposes image alt text through attributes such as descr or title, the current debug trace may append that metadata in a form such as:
drawing:image(word/media/example.png):alt(Example alt text)Image trace parsing preserves alt text that contains ordinary parentheses.
When drawing metadata exposes wp:extent, the current debug trace may append the EMU size in a form such as:
drawing:image(word/media/example.png):size-emu(914400x457200)When unsupported content is found inside a supported paragraph or table, the trace is attached to that owning block and rendered as an adjacent HTML comment only in debug-style output.
Current textbox handling is a limited compromise:
txbxContent nested inside a supported block may contribute plain extracted paragraph textCurrent image handling is also a limited compromise:
[Image: Example alt text]assets--assets-dir <dir> is specified--assets-dir <dir> is used, the current CLI also switches image placeholders to relative Markdown image links when an alt text is available[Content_Types].xml declarations when available and falls back to extension-based media-type inference otherwisemanifest.json with asset path, media type, alt text, byte size, originating unsupported trace, owning block index, and a finer documentPosition object with block kind and per-block trace indexBrowser UI behavior is owned by the separated miku-docx2md-web repository.
Current CLI options include:
--out <file>--assets-dir <dir>--summary--summary-out <file>--debug--include-unsupported-comments--help--debug and --include-unsupported-comments currently enable the same Markdown behavior.
The main remaining implementation questions are: