Block-level language detection

Language identification happens once per Markdown block so prose-metric dispatch can choose the correct locale pipeline.

Tier 0 default

Zero-dependency Unicode-block heuristic. For the English/Japanese split, Unicode-block ratios outperform trigram language models on short inputs because Chinese has no hiragana/katakana:

let total = non_whitespace_non_punct_chars
let kana  = hiragana_chars + katakana_chars
let cjk   = kana + han_chars
let latin = ascii_letter_chars + fullwidth_latin_letter_chars

if kana / total >= 0.15:                 language = ja
elif cjk / total >= 0.40 and kana == 0:  language = zh  (treated as "other")
elif latin / total >= 0.80:              language = en
else:                                    language = other

Opt-in trigram classifiers

Behind Cargo features:

Feature	Library	Notes
`whatlang`	whatlang-rs	Pure Rust, 70 languages, MIT, reliable above ~120 characters.
`lingua`	lingua-rs	Highest accuracy in published benchmarks; restricted to `[English, Japanese]` for binary size.

Tagging rules

A block inherits its parent heading’s language when its own signal is inconclusive.
Code fences, inline code, link targets, image targets, front matter, and HTML are tagged none and excluded from prose metrics.
A document with both English and Japanese blocks is labelled mixed at the document level but each block keeps its own tag for metric routing.

Output

Every block gets a (range, language, confidence) tuple. Metric dispatch reads the language tag to decide which pipeline runs.

References

Cavnar, W. B. & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proc. SDAIR-94 — the trigram language-identification approach used by whatlang and lingua. PDF.
Brown, R. D. (2013). Selecting and weighting n-grams to identify 1100 languages. Proc. TSD — modern accuracy benchmarks for trigram LID. DOI.
Whatlang Rust crate.
Lingua Rust crate.

​Tier 0 default

​Opt-in trigram classifiers

​Tagging rules

​Output

​References

​See also

Tier 0 default

Opt-in trigram classifiers

Tagging rules

Output

References

See also