It really works. Very cool. Iβve been looking for this kind of service for a long time since I started learning Japanese, and Iβve rarely been satisfied with the available services.
I built a context-aware furigana converter for Japanese text, files, and web pages.
The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:
* εΈε ΄: γγ‘γ° or γγγγ
* ε€§ε: γγγγ or γ γγΆ
* δΊΊζ°: γ«γγ or γ²γ¨γ
* ζδΈ: γγγ‘γ γ or γγͺγ or γγͺγ
* ζΉ: γγ or γ»γ
The engine is a hybrid system:
* Sudachi for tokenization, base forms, POS, and candidate readings
* Expanded dictionary coverage for compounds and fixed expressions
* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides
* ModernBERT fallback for 144 especially context-dependent target words
I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.
The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.
I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.
It really works. Very cool. Iβve been looking for this kind of service for a long time since I started learning Japanese, and Iβve rarely been satisfied with the available services.
I built a context-aware furigana converter for Japanese text, files, and web pages.
The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:
* εΈε ΄: γγ‘γ° or γγγγ
* ε€§ε: γγγγ or γ γγΆ
* δΊΊζ°: γ«γγ or γ²γ¨γ
* ζδΈ: γγγ‘γ γ or γγͺγ or γγͺγ
* ζΉ: γγ or γ»γ
The engine is a hybrid system:
* Sudachi for tokenization, base forms, POS, and candidate readings
* Expanded dictionary coverage for compounds and fixed expressions
* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides
* ModernBERT fallback for 144 especially context-dependent target words
I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.
The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.
I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.
Nice work, just gave a quick pass but seems to work well!
(Also: vouched, your comment was dead FYI)
Thanks, thatβs great to hear. Thanks for the vouch too, I didnβt realize the comment was dead.