Language identification happens once per Markdown block so prose-metric dispatch can choose the correct locale pipeline.Documentation Index
Fetch the complete documentation index at: https://mehen.ophi.dev/llms.txt
Use this file to discover all available pages before exploring further.
Tier 0 default
Zero-dependency Unicode-block heuristic. For the English/Japanese split, Unicode-block ratios outperform trigram language models on short inputs because Chinese has no hiragana/katakana:Opt-in trigram classifiers
Behind Cargo features:| Feature | Library | Notes |
|---|---|---|
whatlang | whatlang-rs | Pure Rust, 70 languages, MIT, reliable above ~120 characters. |
lingua | lingua-rs | Highest accuracy in published benchmarks; restricted to [English, Japanese] for binary size. |
Tagging rules
- A block inherits its parent heading’s language when its own signal is inconclusive.
- Code fences, inline code, link targets, image targets, front matter, and HTML are tagged
noneand excluded from prose metrics. - A document with both English and Japanese blocks is labelled
mixedat the document level but each block keeps its own tag for metric routing.
Output
Every block gets a(range, language, confidence) tuple. Metric dispatch reads the language tag to
decide which pipeline runs.
References
- Cavnar, W. B. & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proc. SDAIR-94 —
the trigram language-identification approach used by
whatlangandlingua. PDF. - Brown, R. D. (2013). Selecting and weighting n-grams to identify 1100 languages. Proc. TSD — modern accuracy benchmarks for trigram LID. DOI.
- Whatlang Rust crate.
- Lingua Rust crate.