OMNI-CORE LogoOMNI-CORE
omni-mdxomni-3D (soon)Open SourceAbout
GitHubDocumentation
OMNI-CORE

Knowledge must flow freely to shape the future.

Ecosystem

  • omni-mdx
  • omni-3D

Resources

  • Documentation
  • Interactive Playground

Legal & Open Source

  • GitHub Organization
  • Notice

TOAQ GROUP © 2024 - 2026

Released under the MIT License.

Navigation

Getting Started

  • Introduction
    • Web & Next.js
    • Python Engine
    • Build from Source
  • Syntax Guide

Web Integration

  • Next.js Integration
  • Binary AST Transfer
  • Custom Components
  • Unified & Plugins Ecosystem Integration
    • Basic App Router
    • Advanced Rendering
    • Live Client Editor

Python

  • Introduction & Core Engine
    • Basic Parsing & Traversal
    • Advanced Analysis & RAG
    • Native Qt Rendering
    • HTML & Web Rendering
    • Basic Parsing
    • Advanced Analysis
    • HTML Rendering
    • Qt Rendering

Architecture & Core

    • Design Philosophy
    • The Rendering Pipeline
    • Lexing & Tokenization
    • AST Node Design
    • Math & JSX Handling
    • Protocol Specification
    • Zero-Copy Decoding
    • Memory Lifecycle
    • WASM Bindings (Browser)
    • Node.js Native Addons
    • Python Bindings (PyO3)
  • Security
    • Benchmarks
    • Fuzzing Results
Docs
Architecture
Parser
Lexing & Tokenization

Lexing & Tokenization

Last Updated March 24, 2026

Before an Abstract Syntax Tree (AST) can be built, the raw MDX text must be read and broken down into understandable chunks. This process is called Lexing (or Tokenization).

In Omni-Core, the lexer is the frontline of our performance architecture. It is designed to read the input string exactly once, in a single forward pass, with zero backtracking.

Rust Lexer : Linear Byte Scanning
INPUT: &[u8]
# Intro <Box>$$E=mc^2$$</Box>
OUTPUT: Vec<Token>
Heading
Span: [0..8]
JsxOpen
Span: [8..13]
MathDelimiter
Span: [13..15]
RawText
Span: [15..21]
MathDelimiter
Span: [21..23]
JsxClose
Span: [23..29]

The Regex Bottleneck

Traditional JavaScript parsers rely heavily on Regular Expressions (Regex) to find headings, links, or JSX tags. While Regex is convenient to write, it is computationally expensive. Complex Markdown/JSX files often trigger Regex backtracking, where the engine wastes CPU cycles re-evaluating the same characters multiple times when a pattern fails to match.

To achieve sub-millisecond parsing, Omni-Core entirely abandons Regular Expressions.

Byte-Level Scanning

Instead of treating the input as a high-level string, the Omni-Core lexer consumes the text as a raw stream of UTF-8 bytes (&[u8]) or character iterators in Rust.

We use a custom-built state machine that scans the text character by character:

  • If it sees a #, it checks the next characters for spaces to emit a HeadingToken.
  • If it sees a $, it enters a specialized “Math Lexing” state.
  • If it sees a <, it checks if it’s an HTML tag or a JSX component (by looking for uppercase letters like <Box>).

Because Rust has zero abstraction cost over memory arrays, reading the next byte is virtually instantaneous.

Handling the “Tri-Syntax”

Parsing standard Markdown is relatively easy. Parsing MDX is notoriously difficult because the lexer must seamlessly juggle three entirely different syntaxes at the same time:

  1. CommonMark (Markdown): Indentation-sensitive, line-based syntax.
  2. JSX (React): XML-like tags with JavaScript expressions inside attributes (prop={value}).
  3. LaTeX (Math): Dollar-delimited blocks that must ignore both Markdown and JSX rules inside them.

Omni-Core handles this by context-switching. When the lexer encounters $$, it temporarily disables Markdown and JSX rules, treating everything inside as raw text until it finds the closing $$. This prevents catastrophic bugs where x < y inside an equation is accidentally parsed as a JSX opening tag!

Zero-Allocation Tokens

As the lexer scans the document, it produces Tokens. To keep memory usage close to zero, our tokens do not copy the text they represent.

Instead, a token simply stores its Type and a Span (the start and end indices in the original string).

rust
// A simplified view of an Omni-Core Token
pub struct Token {
    pub kind: TokenKind,
    pub start: usize,
    pub end: usize,
}

By deferring text allocation until the very end of the pipeline (when the OCP buffer is sent to the host language), the Rust engine avoids triggering the system allocator, saving precious milliseconds on large documents.

Boosted by omni-mdx native node

On this page

  • The Regex Bottleneck
  • Byte-Level Scanning
  • Handling the “Tri-Syntax”
  • Zero-Allocation Tokens
Edit this page on GitHub

Caught a typo or want to improve the docs? Submitting a PR is the best way to help!