OMNI-CORE LogoOMNI-CORE
omni-mdxomni-3D (soon)Open SourceAbout
GitHubDocumentation
OMNI-CORE

Knowledge must flow freely to shape the future.

Ecosystem

  • omni-mdx
  • omni-3D

Resources

  • Documentation
  • Interactive Playground

Legal & Open Source

  • GitHub Organization
  • Notice

TOAQ GROUP © 2024 - 2026

Released under the MIT License.

Navigation

Getting Started

  • Introduction
    • Web & Next.js
    • Python Engine
    • Build from Source
  • Syntax Guide

Web Integration

  • Next.js Integration
  • Binary AST Transfer
  • Custom Components
  • Unified & Plugins Ecosystem Integration
    • Basic App Router
    • Advanced Rendering
    • Live Client Editor

Python

  • Introduction & Core Engine
    • Basic Parsing & Traversal
    • Advanced Analysis & RAG
    • Native Qt Rendering
    • HTML & Web Rendering
    • Basic Parsing
    • Advanced Analysis
    • HTML Rendering
    • Qt Rendering

Architecture & Core

    • Design Philosophy
    • The Rendering Pipeline
    • Lexing & Tokenization
    • AST Node Design
    • Math & JSX Handling
    • Protocol Specification
    • Zero-Copy Decoding
    • Memory Lifecycle
    • WASM Bindings (Browser)
    • Node.js Native Addons
    • Python Bindings (PyO3)
  • Security
    • Benchmarks
    • Fuzzing Results
Docs
Python
Examples
Advanced Analysis

Example: Advanced Analysis & RAG Pipelines

Last Updated March 27, 2026

Omni-MDX transforms your documents into highly structured databases. This is the perfect approach for Data Engineering and Artificial Intelligence, allowing you to intelligently chunk documents and extract metadata before feeding them to an LLM or a vector database.

ℹ️ Information
Full Source Code: Clone and test this environment directly from omni-mdx-sandbox/python/advanced-analysis.

Building a Semantic Document Miner

In this scenario, we want to extract the table of contents, isolate all mathematical formulas, and retrieve custom metadata tags hidden in the document.

1. The Source MDX File

Your document contains mixed content, including LaTeX math and invisible metadata components used strictly for your data pipeline:

mdx
## Vector Embeddings
To inject context into an LLM, we use cosine similarity:
$$\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$$


2. The Python Data Pipeline

By recursively walking the AST, you can selectively extract exactly what you need without mixing logic.

python
import omni_mdx

def analyze_document(mdx_source: str) -> dict:
    ast = omni_mdx.parse(mdx_source)
    summary = {"formulas": [], "metadata": []}

    def walk(nodes):
        for node in nodes:
            # Isolate Mathematical Formulas
            if node.node_type in ("InlineMath", "BlockMath"):
                summary["formulas"].append(node.content)
                continue # Stop descending to avoid counting math as text
                
            # Extract JSX metadata components
            elif node.node_type == "DocMeta":
                summary["metadata"].append(node.attributes or {})
            
            walk(node.children)

    walk(ast.nodes)
    return summary

# Result: 
# {
#   "formulas": ["\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}"],
#   "metadata": [{"author": "Alice", "department": "Research"}]
# }

This ensures that your LLM context window receives clean, structured JSON rather than raw, noisy Markdown formatting.

Boosted by omni-mdx native node

On this page

  • Building a Semantic Document Miner
  • 1. The Source MDX File
  • 2. The Python Data Pipeline
Edit this page on GitHub

Caught a typo or want to improve the docs? Submitting a PR is the best way to help!