Basically, thanks to @David Carlisle's answer, I've managed to extract all the Unicode codes that LaTeX does know how to print, and I've created a simple function in ruby that only keeps those characters in the string, while replacing all the other ones with a ? character.
### not the prettiest code in the world but it works for me.
### First we remove all the invalid utf8 characters, then we remove
### all unprintable characters, and then we remove all the characters
### that LaTeX does not know how to render.
def utf8_tex(txt)
begin
out = txt&.chars&.select(&:valid_encoding?)&.join
out2 = out&.gsub(/[^[:print:]]/,'')
rescue
puts "fail on remove invalid"
end
tex_valid = [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,
220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231,
232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243,
244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267,
268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291,
292, 293, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305,
306, 307, 308, 309, 310, 311, 313, 314, 315, 316, 317, 318,
321, 322, 323, 324, 325, 326, 327, 328, 330, 331, 332, 333,
334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345,
346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357,
360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371,
372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 402,
461, 462, 463, 464, 465, 466, 467, 468, 482, 483, 486, 487,
488, 489, 490, 491, 496, 500, 501, 536, 537, 538, 539, 562,
563, 567, 710, 711, 732, 728, 729, 731, 733, 1024, 1025, 1026,
1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036,
1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046,
1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056,
1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066,
1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076,
1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086,
1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096,
1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106,
1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116,
1117, 1118, 1119, 1122, 1123, 1130, 1131, 1138, 1139, 1140,
1141, 1142, 1143, 1164, 1165, 1166, 1167, 1168, 1169, 1170,
1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180,
1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190,
1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200,
1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210,
1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220,
1221, 1222, 1223, 1224, 1227, 1228, 1229, 1230, 1232, 1233,
1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243,
1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253,
1254, 1255, 1256, 1257, 1260, 1261, 1262, 1263, 1264, 1265,
1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275,
1276, 1277, 1278, 1279, 3647, 7682, 7683, 7838, 8204, 8208,
8209, 8210, 8211, 8212, 8213, 8214, 8216, 8217, 8218, 8220,
8221, 8222, 8224, 8225, 8226, 8230, 8240, 8241, 8249, 8250,
8251, 8253, 8260, 8270, 8274, 8353, 8356, 8358, 8361, 8363,
8364, 8369, 8451, 8470, 8471, 8478, 8480, 8482, 8486, 8487,
8494, 8592, 8593, 8594, 8595, 9001, 9002, 9250, 9251, 9702,
9711, 9834, 10216, 10217, 7712, 7713, 64256, 64257, 64258,
64259, 64260, 64261, 64262, 65279]
clean = out2.unpack('U*').map{|x| (x<160 or tex_valid.include?(x)) ? x : 63}.pack('U*')
end
Credit where credit is due: pack/unpack
\usepackage[utf8]{inputenx}as utf8 (followinginputencconventions) is already the default ecoding in current latex – David Carlisle Jul 27 '22 at 22:19