8

How can I re-encode a text file (or string) so that non-ASCII characters will be represented using Mathematica's standard names?

For example, if the original contents of the file was

xαy

then the re-encoded file should contain

x\[Alpha]y

Using the CharacterEncoding "Mathematica7" comes close, but it is too aggressive: it also encodes ASCII characters like x and y:

ExportString["xαy", "Text", CharacterEncoding -> "Mathematica7"]
(* "\\170\\[Alpha]\\171" *)

I do not want this. Update: Looking more closely, this is probably not the purpose of the "MathematicaN" character encodings.

I also do not want to change the contents of the file, only change its encoding. This means that re-wrapping lines, eliminating whitespace, changing new lines, etc. is not acceptable.


Why do I need this? I want to take a package file that is UTF-8 encoded and make it platform independent. How such a file gets read by Get depends on $CharacterEncoding, which may differ between computers.

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • What is about saving it as "Package"? – Alexey Popkov Apr 30 '17 at 13:56
  • ExportString[xαy, "Package"] returns a string with x\[Alpha]y as the last line. Isn't it what you need? – Alexey Popkov Apr 30 '17 at 13:58
  • @AlexeyPopkov But it also does other things, such as insert quotation marks, convert \ to \\, convert any newline to \n, reformat for a certain width, etc. I do not want these. – Szabolcs Apr 30 '17 at 13:59
  • @AlexeyPopkov Given my use case, I guess I could simply import the old package wrapped in HoldComplete (with a controlled $CharacterEncoding), then re-export it (both as "Package"). But this also has undesirable side effects: it drops all comments and it destroys code formatting (making the code almost unreadable). I would prefer to only touch non-ASCII characters, and simply replace them by their Mathematica-name. – Szabolcs Apr 30 '17 at 14:01
  • Normally packages contain only ASCII characters. From where come non-ASCII characters? – Alexey Popkov Apr 30 '17 at 14:02
  • According to Docs, "Package" (.m) is "Plain ASCII text format." – Alexey Popkov Apr 30 '17 at 14:03
  • 1
    @AlexeyPopkov You can type anything in an .m file. I typed non-ASCII characters. Others have done the same because on both OS X and Linux UTF-8 works just fine with Mathematica. On Windows, sometimes it does and sometimes it doesn't. It depends on the settings. There are .m files around with non-ASCII encodings. I do not want to go through the file and fix everything manually. I want Mathematica to do it for me. – Szabolcs Apr 30 '17 at 14:05
  • 1
    Then probably replacement with escaped versions would solve this: ExportString[StringReplace["xαy", "α" -> "\\[Alpha]"], "Text"]? – Alexey Popkov Apr 30 '17 at 14:08
  • Yes, but the list of named characters is long, and it is clear that Mathematica knows them, so why do I have to type in hundreds of replacement rules to make sure that nothing would ever get missed? – Szabolcs Apr 30 '17 at 14:10
  • What may work is opening the .m file using the FE and re-saving it. This can also be automated. – Szabolcs Apr 30 '17 at 14:13
  • 1
    Could this be of any help? – Greg Hurst May 01 '17 at 00:47
  • @ChipHurst Absolutely. If you post it as an answer, I'll accept it. ``fromCode[c_Integer] /; c < 160 := FromCharacterCode[c]; fromCode[c_Integer] := "\[" <> SystemPrivateLookupNameByCode[c] <> "]";

    StringJoin@ Map[fromCode]@ ToCharacterCode@ Import["myfile", "Text", CharacterEncoding -> "UTF-8"]``

    – Szabolcs May 01 '17 at 08:27

3 Answers3

5

Based off information from this thread, the following should work.

fromCode[c_Integer] /; c < 160 := FromCharacterCode[c];

fromCode[c_Integer] := "\\[" <> System`Private`LookupNameByCode[c] <> "]";

Test:

Export["myfile", "x\[Alpha]y", "Text"];

StringJoin[fromCode /@ ToCharacterCode[
  Import["myfile", "Text", CharacterEncoding -> "UTF-8"]]]
"x\\[Alpha]y"
Greg Hurst
  • 35,921
  • 1
  • 90
  • 136
5

Here is a robust version of fromCode which uses only well-documented functionality, and correctly handles extended-ASCII and Unicode characters with which the original version fails:

fromCode[c_Integer] /; c <= 127 := FromCharacterCode[c];
fromCode[c_Integer] := 
  StringTake[ToString[FromCharacterCode[c], InputForm, 
    CharacterEncoding -> "PrintableASCII"], {2, -2}];

Notes:

  1. The ASCII character set contains characters with codes up to 127 inclusively, so the upper bound is set to 127.

  2. When importing as "Text" we don't have to specify CharacterEncoding -> "UTF8" explicitly, since

    Import["file.txt"] reads a text file, taking the character encoding to be "UTF8" by default.

Testing:

Export["myfile", "xαy\nLamé \[LongRightArrow] αβ+", "Text"];

StringJoin[fromCode /@ ToCharacterCode[Import["myfile", "Text"]]]

"x\\[Alpha]y
Lam\\[EAcute] \\[LongRightArrow] \\[Alpha]\\[Beta]+"

Another approach is to use ExportAsASCII function from this answer which should be much more efficient:

ExportAsASCII["myfileInASCII", Import["myfile", "Text"]]
Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
4

For the particular use case I am interested in (i.e. replacing non-ASCII characters in package files), one can simply open the file (.m or .wl) with the Front End and re-save it.

This can also be automated:

NotebookSave@NotebookOpen["mypackage.m"] (* Warning: this overwrites the file! *)

This method does insert (:: Package ::) at the beginning of the file, and does require the system character encoding to be the same as that of the .m file. It may also change the newline style (LF vs CR/LF). But these are relatively minor inconveniences.

The code formatting (indentation, etc.) is preserved.

I verified that nothing else is changed by diffing the end result with the original input file.

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263