miku-docx2md

docx2md Implementation Specification

1. Document Role

This document records the current implementation-aligned behavior of docx2md.

It complements:

When implementation behavior differs from idealized design intent, this document should describe the implementation behavior explicitly.

2. Current Implementation Scope

The current first cut includes:

The current first cut still excludes:

3. Overall Flow

The current end-to-end flow is:

  1. read a .docx file as bytes
  2. expand ZIP entries in-house
  3. load word/document.xml
  4. optionally load word/_rels/document.xml.rels, word/styles.xml, word/numbering.xml, and [Content_Types].xml
  5. parse document blocks in document order
  6. build a lightweight parsed document with blocks, summary, and any resolved sidecar assets
  7. render Markdown
  8. optionally emit summary text, unsupported debug comments, and exported image assets

4. ZIP Handling

Current behavior:

If word/document.xml is missing, parsing fails with an explicit error.

5. XML Utilities

Current behavior:

Whitespace is not aggressively normalized at XML-read time. Most text normalization happens during inline extraction and Markdown rendering.

6. Relationship Resolution

Current behavior:

For hyperlink rendering:

7. Style and Heading Resolution

Current behavior:

Heading levels are clamped to Markdown heading range 1..6.

8. Inline Formatting

Current behavior:

Hyperlink text suppresses underline wrapping so that link syntax is not nested with underline output.

Current behavior:

10. Numbering and Lists

Current behavior:

If numbering metadata cannot be resolved, the paragraph falls back to ordinary paragraph behavior rather than forcing a synthetic list.

11. Tables

Current behavior:

Markdown rendering uses the first row as the header row.

12. Markdown Rendering

Current behavior:

Anchor rendering is inserted immediately before the owning paragraph, heading, or list item block.

13. Summary and Diagnostics

Current summary fields are:

unsupportedCommentTraces currently counts both standalone unsupported blocks and unsupported traces attached to supported blocks.

Unsupported element traces currently use a small normalized category set for common cases:

When unsupported traces are rendered as debug HTML comments, comment-breaking sequences from source metadata are sanitized so the debug output does not prematurely close the comment.

For drawing-like unsupported elements, when an embedded image relationship can be resolved safely, the current debug trace may include the package target in a form such as:

When drawing metadata exposes image alt text through attributes such as descr or title, the current debug trace may append that metadata in a form such as:

Image trace parsing preserves alt text that contains ordinary parentheses.

When drawing metadata exposes wp:extent, the current debug trace may append the EMU size in a form such as:

When unsupported content is found inside a supported paragraph or table, the trace is attached to that owning block and rendered as an adjacent HTML comment only in debug-style output.

Current textbox handling is a limited compromise:

Current image handling is also a limited compromise:

14. Node.js CLI

Browser UI behavior is owned by the separated miku-docx2md-web repository.

Current CLI options include:

--debug and --include-unsupported-comments currently enable the same Markdown behavior.

15. Open Items

The main remaining implementation questions are: