I am having trouble creating an hyphenation dictionary for the Khmer language with patgen. I keep getting errors like, "Bad representation" and "Bad character" but I am not sure what I am doing wrong. Khmer is UTF-8.
Part of my khmer.dic (I tried UTF-8 on the first line but that didn't help):
ខិត-ខំ
ប្រឹង-ប្រែង
យក-ចិត្ត-ទុក-ដាក់
ព្រះយេស៊ូវ-គ្រីស្ទ
កណ្ឌ-គម្ពីរ
សញ្ញា-ថ្មី
But I'm not sure what to use for the translation file. I've seen this tutorial as well as read this and this but I still can't figure out what to do. Can anyone give me more specific instructions?
Khmer doesn't have upper or lower case (all the same), so I am not sure what to do with the translation file (khmer.tra). Should I include all the Khmer alphabet? Here's what I have now:
2 3
ក
ខ
គ
ឃ
ង
ច
ឆ
ជ
ឈ
ញ
ដ
ឋ
ឌ
ឍ
ណ
ត
ថ
ទ
ធ
ន
ប
ផ
ព
ភ
ម
យ
រ
ល
វ
ឝ
ឞ
ស
ហ
ឡ
អ
ា
ិ
ី
ឹ
ឺ
ុ
ូ
ួ
ើ
ឿ
ៀ
េ
ែ
ៃ
ោ
ៅ
ំ
ា
ះ
ឥ
ឦ
ឧ
ឨ
ឩ
ឪ
ឫ
ឬ
ឭ
ឮ
ឯ
ឰ
ឱ
ឳ
ឲ
្
I am using the command: patgen khmer.dic khmer.pat khmer.log khmer.tra in Ubuntu and using (even though I don't fully understand what these are for): hyph_start: 1 hyph_finish: 2 pat_start: 2 pat_finish: 4 good weight: 1 bad weight: 1 threshold: 1
patgenwould be able to support utf-8 without modification. consider this:patgenwas created in the early 1980s. utf-8 didn't appear until at least 1990. this mail message may or may not be accurate, but the dates make sense, and the message itself makes for amusing reading: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt – barbara beeton Oct 08 '14 at 16:00