Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mehen.ophi.dev/llms.txt

Use this file to discover all available pages before exploring further.

Language identification happens once per Markdown block so prose-metric dispatch can choose the correct locale pipeline.

Tier 0 default

Zero-dependency Unicode-block heuristic. For the English/Japanese split, Unicode-block ratios outperform trigram language models on short inputs because Chinese has no hiragana/katakana:
let total = non_whitespace_non_punct_chars
let kana  = hiragana_chars + katakana_chars
let cjk   = kana + han_chars
let latin = ascii_letter_chars + fullwidth_latin_letter_chars

if kana / total >= 0.15:                 language = ja
elif cjk / total >= 0.40 and kana == 0:  language = zh  (treated as "other")
elif latin / total >= 0.80:              language = en
else:                                    language = other

Opt-in trigram classifiers

Behind Cargo features:
FeatureLibraryNotes
whatlangwhatlang-rsPure Rust, 70 languages, MIT, reliable above ~120 characters.
lingualingua-rsHighest accuracy in published benchmarks; restricted to [English, Japanese] for binary size.

Tagging rules

  • A block inherits its parent heading’s language when its own signal is inconclusive.
  • Code fences, inline code, link targets, image targets, front matter, and HTML are tagged none and excluded from prose metrics.
  • A document with both English and Japanese blocks is labelled mixed at the document level but each block keeps its own tag for metric routing.

Output

Every block gets a (range, language, confidence) tuple. Metric dispatch reads the language tag to decide which pipeline runs.

References

  • Cavnar, W. B. & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proc. SDAIR-94 — the trigram language-identification approach used by whatlang and lingua. PDF.
  • Brown, R. D. (2013). Selecting and weighting n-grams to identify 1100 languages. Proc. TSD — modern accuracy benchmarks for trigram LID. DOI.
  • Whatlang Rust crate.
  • Lingua Rust crate.

See also