miku-docx2md

docx2md Specification

1. Document Overview

docx2md is a tool that reads Word documents in .docx format and converts their textual structure into Markdown.

The goal is not to reproduce the visual appearance of Microsoft Word documents exactly. The goal is to extract document structure and meaningful text in a form that is easy for humans to read and easy for generative AI systems to consume.

This tool should be designed as:

not simply:

This document describes the high-level specification and design policy. Implementation-specific behavior should be documented separately in an implementation specification once the first cut exists.

2. Scope

2.1 Supported Input

2.2 Unsupported Input

The first cut does not target:

2.3 Supported Content in First Cut

The first cut should focus on textual and structural content that maps cleanly into Markdown.

2.4 Out of Scope in First Cut

The first cut intentionally excludes visual and layout-heavy reproduction.

Resolved embedded image files may be exposed as sidecar assets, but this does not imply Word-like layout reproduction. Other visual features may be considered later, but should not complicate the first implementation.

3. Target Documents

docx2md primarily targets document-style inputs whose main value is in the written content and document hierarchy.

4. Design Principles

4.1 Output Purpose

The Markdown output should aim to satisfy the following:

4.2 Conversion Policy

In other words, the conversion policy is:

This is a deliberate output policy, not merely an implementation limitation.

4.3 Relationship to miku-xlsx2md

docx2md may reuse ideas from the sibling app miku-xlsx2md, especially:

Unless there is a clear docx-specific reason to differ, docx2md should imitate miku-xlsx2md in implementation style, naming discipline, test style, and browser/runtime separation. Browser UI composition and Single-file Web App packaging are owned by the separated miku-docx2md-web repository.

However, docx2md should not inherit spreadsheet-specific behavior. There is no table-region detection problem equivalent to Excel sheet analysis. The main parsing targets are document order, paragraph style, numbering, runs, and tables.

4.4 File and Module Split Policy

The source file split should also imitate miku-xlsx2md where practical.

The split should remain reasonable for the smaller docx scope. It is not necessary to copy the spreadsheet app file count mechanically, but the same style should be followed:

In short, prefer modest responsibility-based file splitting over monolithic implementation files, while avoiding artificial fragmentation.

5. Output Unit and File Structure

5.1 Output Unit

The first cut should treat one .docx file as one input and produce one Markdown document as the primary output.

5.2 Default Output

The primary output should be:

When resolved embedded images are exported explicitly, a sidecar asset directory or archive may accompany the Markdown output. The primary output remains the Markdown document; sidecar image export is optional and should not require layout reconstruction.

5.3 Naming

The default output file name should be based on the input document name.

Example:

6. Parsing Model

6.1 Container Handling

A .docx file should be treated as a ZIP package.

As with miku-xlsx2md, ZIP expansion should be implemented in-house from scratch rather than delegated to an external ZIP library. The implementation should follow the same general direction as the sibling app:

This is intended to preserve architectural consistency with miku-xlsx2md and keep the core parsing stack understandable and testable.

The first cut should read at least the following package entries when present:

6.2 Core Internal Model

The internal model may remain small in the first cut.

The first cut may also keep lightweight internal metadata for:

6.3 Reading Order

The parser should preserve the document order found in word/document.xml. This is the primary structural axis for docx2md.

7. Markdown Conversion Rules

7.1 Paragraphs

7.2 Headings

Heading detection in the first cut should use both paragraph style information and outline level information. This combined approach is more robust than relying on only one of them.

Recommended priority:

  1. heading-equivalent paragraph style
  2. outline level
  3. otherwise treat as a normal paragraph

More specifically:

For compatibility, heading recognition should not depend only on display labels. The implementation should prefer structural identifiers such as styleId and resolved style definitions. Localized style names may be used only as a compatibility aid.

First-cut compatibility should include at least the following heading-style families when they can be identified safely:

Heading level mapping should be:

The first cut should not infer headings only from appearance. The following alone are not enough to classify a paragraph as a heading:

7.3 Inline Formatting

When multiple inline styles apply to the same text, the rendering order should follow the sibling-app approach for deterministic output. Recommended wrapper application order is:

  1. underline
  2. strike
  3. italic
  4. bold

This means the final visible output places bold outermost when all four styles are active.

7.5 Lists

List handling in the first cut should use numbering.xml and paragraph numbering properties as the primary structural source. The implementation should not rely on visual indentation alone.

The first cut should support at least:

Recommended interpretation order:

  1. resolve the paragraph numbering reference from the paragraph properties
  2. resolve the numbering instance through numId
  3. resolve the abstract numbering definition
  4. use the paragraph level such as ilvl to determine nesting depth
  5. use the numbering definition to distinguish bullet-style and ordered-style items

Markdown rendering policy:

The first cut does not need to reproduce every numbering style variation exactly. For example, the following may be normalized while still preserving list structure:

In such cases, preserving ordered vs unordered structure and nesting depth is more important than preserving the exact marker text.

The first cut should also define clear limits:

7.6 Tables

Merged cells should be simplified in the first cut using explicit merge placeholders rather than attempting HTML table reproduction.

Recommended merge rendering policy:

In other words, merge placeholder priority should be:

  1. if there is a parent cell on the left, use ←M←
  2. otherwise, if there is a parent cell above, use ↑M↑

Additional first-cut table rules:

7.7 Line Break and Whitespace Normalization

Line break and whitespace normalization should prioritize stable Markdown output over layout-oriented fidelity.

Document/block-level rules:

Paragraph-level rules:

Whitespace normalization rules:

Cell-level rules:

List-item text should follow the same general normalization policy as ordinary paragraphs unless a later implementation section defines a narrower exception.

7.8 Style Resolution Depth

The first cut should resolve only the style layers needed for structural extraction and supported inline formatting.

Recommended priority order:

  1. direct formatting on the paragraph or run
  2. character style
  3. paragraph style
  4. inherited style chain via basedOn

Style resolution should be deep enough to support at least:

For supported text emphasis, the intended precedence is:

  1. direct run formatting
  2. character style resolved through rStyle
  3. paragraph-local run properties
  4. paragraph style
  5. inherited basedOn chain

When a supported style flag is explicitly disabled at a narrower scope, that narrower scope should take precedence over inherited formatting.

The first cut does not need to resolve every style-related visual detail. It should prioritize structure and supported Markdown-facing emphasis rather than Word layout fidelity.

Implementation safety rules:

8. Error and Fallback Policy

This includes a deliberate output-policy compromise similar to miku-xlsx2md:

Unsupported elements may leave an HTML comment trace in the Markdown when that helps preserve document understanding without heavily disturbing readability. However, this should be disabled by default in normal output.

Recommended first-cut fallback policy for unsupported elements:

The first cut may expose this through a dedicated option such as:

The exact option name may be finalized later, but the policy should be:

Examples of acceptable fallback direction:

When multiple XML spellings represent the same broad unsupported feature, the implementation may normalize them into one concise diagnostic category.

An implementation may also choose a narrower compromise for text boxes: preserve plain textual paragraphs from txbxContent when they can be extracted safely, while still treating textbox layout and placement as unsupported. For image-like drawing content, an implementation may keep the image unsupported while still emitting a debug-oriented placeholder trace that includes the resolved relationship target and, when safely available, metadata such as alt text (descr / title) and drawing extent. When meaningful alt text is available, the implementation may also emit a small non-debug placeholder such as [Image: ...] without attempting inline layout reproduction. For Node-oriented workflows, an implementation may additionally expose resolved embedded image package entries as sidecar assets without attempting layout reconstruction or automatic Markdown image embedding. When such sidecar asset export is enabled explicitly, the implementation may choose to replace that placeholder with a relative Markdown image link such as ![alt](assets/word/media/example.png). When package content types are available, an implementation may prefer those declarations over file-extension inference when reporting exported asset media types. An implementation may also include a small manifest file alongside exported assets so downstream tools can recover path, media-type, alt-text, and byte-size metadata without reparsing the source .docx. When useful for diagnostics, that manifest may also include the originating unsupported trace string, the owning block index, and a finer document-position object such as block kind plus per-block trace index.

The first cut should prefer concise comment traces over large raw XML dumps or long explanatory blocks.

9. Summary and Diagnostics

The first cut should maintain a lightweight conversion summary and unsupported-element diagnostics, following the general sibling-app philosophy of keeping conversion behavior observable.

Recommended summary items include at least:

unsupportedCommentTraces does not need to equal only top-level unsupported blocks; it may also include traces attached to supported blocks that contained unsupported nested elements.

10. First-Cut Test Coverage

The first cut should be validated primarily through fixture-based tests. The fixture set should be small but intentionally representative of the supported feature boundaries.

Recommended first-cut coverage includes at least:

The first-cut test set should prioritize deterministic Markdown output. Tests should prefer exact-output assertions for stable representative fixtures whenever practical.

11. Runtime and Packaging Direction

The intended direction is to follow the sibling app style where practical.

Current implementations may include CLI support when it follows the same local-processing and testable-core direction.

12. Initial Development Priorities

Recommended implementation order:

  1. ZIP entry reading for .docx
  2. plain paragraph extraction
  3. inline run formatting
  4. headings
  5. hyperlinks
  6. lists via numbering
  7. tables
  8. summary and tests