Architecture

mehen’s internals are organized around a clean separation between parsing, analyzing, and reporting.

Crate layout

Crate	Responsibility
`mehen-cli`	CLI binary — entry point, command routing, exit codes.
`mehen-engine`	Pipeline orchestration — `run_diff`, `run_top_offenders`, registry, language detection.
`mehen-core`	Parser-neutral domain types and the `LanguageAnalyzer` trait.
`mehen-metrics`	Shared metric formulas, accumulators, finalizers (Halstead, cyclomatic, cognitive, MI, ABC, LOC, NOM, NPA, NPM, WMC).
`mehen-<lang>`	Per-language analyzer crates. Each owns parsing and metric interpretation.
`mehen-tree-sitter`	Shared tree-sitter wrapper and CST traversal helpers, used only by the tree-sitter-backed languages.
`mehen-antlr`	Shared support for ANTLR-backed analyzers (Kotlin, Java) — re-exports the runtime and provides char→byte span conversion, recovered-error diagnostics, and hidden-channel comment (CLOC) extraction.
`mehen-markdown`	Markdown analyzer (pulldown-cmark) with embedded-code dispatch via `LanguageDispatcher`.
`mehen-sql`	SQL analyzer (sqruff). Publishes a dedicated `sql.*` metric family instead of the source-code families; the sqruff CST is confined behind a parser-neutral `SqlFileFacts` adapter.
`mehen-git`	Git operations for `mehen diff`.
`mehen-report`	JSON, YAML, TOML, and GitHub-Markdown rendering.
`xtask`	Developer-only commands (kind-enum codegen, AST dumps, audits).

Parser diversity, single contract

A core architectural decision: each language uses the parser best suited to it, but every analyzer returns the same LanguageAnalysis shape. mehen does not force a single AST model across languages.

Language	Parser	What it gives us
Python	Ruff	Modern Python syntax (match, exception groups, f-strings, async); typed AST plus semantic model.
TS / JS / TSX / JSX	Oxc	Decorators, class fields, parameter properties, JSX nesting, `satisfies`, dynamic import.
PHP	Mago	Attributes, promoted properties, enums, traits, readonly, null-safe calls, `match`.
Ruby	Prism	Blocks, lambdas, numbered params, rescue modifiers, endless methods, pattern matching.
Rust	`ra_ap_syntax`	The same syntax library rust-analyzer uses internally.
Kotlin	ANTLR (official Kotlin spec grammar)	Semantically-named rules for `when` entries, elvis `?:`, safe-call `?.`, `catch` blocks, labeled jumps, and property accessors — richer than a tree-sitter CST.
Java	ANTLR (grammars-v4 Java grammar)	Records, sealed types, switch expressions, text blocks, pattern matching, modules — as first-class, semantically-named rules.
SQL	sqruff	Dialect-aware SQL CST (postgres, T-SQL, snowflake, bigquery, …) for CTE/join/subquery structure and object-touch analysis.
Markdown	pulldown-cmark	CommonMark + GFM with byte-accurate spans.
Go, C, PowerShell	tree-sitter	Mature grammars where tree-sitter’s coverage is the best available.

mehen-metrics owns the math (Halstead formulas, MI variants, cyclomatic accumulators); each per-language analyzer owns the interpretation — which syntax constructs count as decisions, which tokens classify as Halstead operators, which members count as public methods.

Pipeline

file path
  ↓
language detection (extension-driven)
  ↓
parser (per-language: Ruff / Oxc / Mago / Prism / ra_ap_syntax /
        ANTLR / sqruff / pulldown-cmark / tree-sitter)
  ↓
analyzer.analyze(tree)              ← per-language metric interpretation
  ↓
LanguageAnalysis  (Send + 'static)  ← parser-neutral, owned, no parser-arena lifetimes
  ↓
metric finalizers (mehen-metrics)
  ↓
reporter (mehen-report)             ← JSON / Markdown / YAML / TOML

The key constraint: LanguageAnalysis is parser-neutral and Send + 'static. That lets the engine analyze files in parallel via rayon without arena-backed parsers (Oxc, Mago) leaking lifetimes across thread boundaries.

Markdown is special

The Markdown analyzer (mehen-markdown) parses the document into a block/inline AST and dispatches code fences back through LanguageDispatcher. That is how Markdown Halstead and MCC credit embedded code blocks without duplicating language logic.

ANTLR-backed analyzers

Kotlin and Java run through ANTLR v4 grammars generated to Rust ahead of time and checked in, so a normal cargo build never needs Java or the ANTLR jar. The shared mehen-antlr crate carries the concerns every ANTLR analyzer needs but tree-sitter and the language-specific parsers handle differently:

Span conversion. ANTLR reports positions in character offsets; mehen’s spans are byte offsets. mehen-antlr converts once, centrally.
Comments live on a hidden channel. They are absent from the parse tree, so CLOC is recovered from the token stream rather than by walking nodes.
No parent pointers. ANTLR rule contexts cannot look upward, so analyzers thread parent-dependent context (e.g. else-if detection) top-down as they descend the tree.

Despite those differences, the analyzers still return the same parser-neutral LanguageAnalysis as every other backend. See Add a new language → ANTLR for the full flow.

SQL is its own metric family

SQL does not fit the function/class-centric model that the source-code metrics assume — a declarative SELECT has no methods, classes, or imperative branches to count. mehen-sql therefore publishes a dedicated sql.* metric namespace (CTE graphs, join/subquery structure, object-touch risk, SQL Halstead, composite scores) instead of the shared code families. The sqruff CST is confined to the crate’s facts.rs behind a parser-neutral SqlFileFacts adapter, so the rest of the pipeline never sees a sqruff type. Dialect resolution (directive-or-inference, validated before sqruff ever runs) also lives inside the crate. See SQL metrics.

​Crate layout

​Parser diversity, single contract

​Pipeline

​Markdown is special

​ANTLR-backed analyzers

​SQL is its own metric family

​See also