There are many plain text files which were encoded in variant charsets.
I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.
Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.
Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.
python-chardetin Ubuntu universe repo. – Lenik Jun 25 '11 at 06:21chardetwill still give the most correctly guess, like./a.txt: GB2312 (confidence: 0.99). Compared to Enca which just failed and report 'Unrecognized encoding'. However, sadly enough,chardetruns very slow. – Lenik Jun 25 '11 at 06:48chardet <(head -c4000 filename.txt)was much faster and equally successful for my use-case. (in case it's not clear this bash syntax will send only the first 4000 bytes to chardet) – ndemou Dec 26 '15 at 19:32chardet==3.0.4, and the command line tool's actual executable name ischardetectnotchardet. – Devy Mar 26 '18 at 14:26