Addyco

# Search: which is better/more optimal for llm ingestion mdx or markdoc > Wednesday, April 15, 2026 · `gemini-3.1-pro-preview` ## Findings When evaluating document formats for Large Language Model (LLM) ingestion—such as building a knowledge base for Retrieval-Augmented Generation (RAG) or fine-tuning—**Markdoc is generally the better and more optimal choice compared to MDX.** While both formats are extensions of Markdown, their underlying philosophies differ significantly. Markdoc is strictly declarative and optimized for structured content parsing, whereas MDX is designed to embed executable JavaScript/React components directly into the text. For an LLM, executable code introduces syntactic noise, making Markdoc the cleaner, more token-efficient option. Here is a thorough analysis of how they compare for LLM ingestion. --- ### What They Are * **MDX (Markdown + JSX):** An authoring format that allows you to write JSX (React, Vue, or other framework components) directly inside Markdown files. It enables highly interactive, component-driven documentation. * **Markdoc:** An open-source, Markdown-based syntax and toolchain created by Stripe. It uses a custom, fully declarative tag syntax (e.g., `{% callout type="warning" %}`) to extend Markdown without introducing arbitrary programming logic. ### How They Work in LLM Pipelines To ingest documents into an LLM, the text must be read, parsed into a semantic structure, chunked into smaller segments (usually 256–1,024 tokens), embedded into a vector database, and fed to the LLM during query time. A high-quality ingestion format requires **high signal-to-noise ratio**, **predictable structure**, and **metadata extraction capabilities**. * **Ingesting MDX:** MDX parsers must handle both CommonMark and JavaScript/JSX ASTs (Abstract Syntax Trees). An MDX file often contains `import` statements, `export` variables, and inline JavaScript logic (e.g., `{array.map(item => <Component />)}`). To optimize this for an LLM, data engineers must use specialized chunkers to strip out the JSX logic, which can otherwise confuse the LLM and bloat token costs. * **Ingesting Markdoc:** Markdoc compiles content into a strict, predictable AST. Because Markdoc separates content from rendering logic, the AST contains pure data. During ingestion, a Python or Node.js script can trivially traverse the Markdoc AST, strip out custom tags, preserve the raw text, extract the frontmatter (metadata), and chunk the document semantically based on native Markdown headers. ### Key Features: MDX vs. Markdoc for Ingestion | Feature | Markdoc (Optimal) | MDX (Less Optimal) | | :--- | :--- | :--- | | **Token Efficiency** | **High.** Generates clean, declarative tags that are easily stripped or left as-is with minimal token overhead. | **Low.** JSX components, CSS-in-JS props, and import/export statements consume heavy token counts. | | **Syntactic Noise** | **Minimal.** `{% myTag prop="value" %}` is distinct and easy for both regex and AST parsers to isolate. | **High.** Angle brackets (`< >`) and nested curly braces (`{ }`) can easily disrupt standard text-splitters. | | **AST Parsing & Chunking** | **Excellent.** The built-in Markdoc parser yields a highly structured, serialized JSON tree perfect for semantic RAG chunking. | **Complex.** Requires heavy toolchains (like `remark`/`rehype` unified pipelines) to safely parse and strip out the React code. | | **Code Execution / Safety** | **Strictly safe.** Fully declarative; cannot execute arbitrary code. Ideal for enterprise ingestion pipelines. | **Turing-complete.** MDX can execute arbitrary JavaScript. If the ingestion source is untrusted, it introduces security risks. | ### Current Status and Notable Details * **Markdoc:** As of early 2026, Markdoc is on version **0.5.x** (with recent minor patches like v0.5.7). Because it was built by Stripe specifically to handle massive, structured documentation portals without tying the content to a specific frontend framework, it is natively tailored for data extraction. * **MDX:** Currently on the **v3.x** branch (standardized in late 2023/2024). MDX remains the industry darling for frontend developers building static sites (via Next.js, Docusaurus, or Mintlify). * **The AI Twist (Generative UI):** While Markdoc wins for *ingestion*, **MDX is highly popular for LLM *generation***. Modern AI frameworks (like Vercel AI SDK) frequently prompt the LLM to output MDX so the frontend can instantly render interactive React components (Generative UI). ### Related Tools and Alternatives If you are designing a pipeline specifically for LLM knowledge bases, you might also consider: 1. **Pure Markdown (CommonMark / GFM):** The absolute gold standard for LLM ingestion. It has zero overhead. If you don't *need* custom frontend components, downgrade your files to standard `.md`. 2. **Quarto / Jupyter Notebooks:** For heavily computational pipelines where the LLM needs to ingest code alongside data outputs (e.g., Python execution). It is much heavier than Markdoc but serves scientific RAG use cases. 3. **AsciiDoc / reStructuredText (Sphinx):** Legacy documentation formats that are highly structured, but their parsers are often less friendly to modern Node/Python AI stacks compared to Markdown-based ASTs. ### Summary Verdict For **LLM ingestion**, **Markdoc** is strictly superior. Its declarative nature and clean AST allow you to easily extract semantic context and minimize token waste. If your documents are already in **MDX**, you should utilize AST-aware parsers (using the `unified`/`remark` ecosystem) to strip away the JavaScript and JSX node trees before chunking the remaining text into your vector database. ## Sources 1. mindstudio.ai 2. markdownconverters.com 3. webcraft.tw 4. specstory.com 5. anythingmd.com --- *Search queries: "MDX version 2025 2026", "Markdoc LLM ingestion OR RAG", "Markdoc version 2025 2026", ""MDX vs Markdoc" LLM OR RAG", "MDX LLM ingestion OR RAG"*

which is better/more optimal for llm ingestion mdx or markdoc

Artifact metadata

Preview