11

I'm using LaTeX for PDF generation programmatically. I know that I can \DeclareUnicodeCharacter but that works when you have one file. I'm trying to generate PDFs from the content in the database that was scraped from the net.

I prefer having a PDF with a few jumbled characters to having my users get error when trying to generate a PDF.

How can I force LaTeX to simply discard unknown characters. Since I'm working with scraped content, they are mostly there by mistake.

I've seen this question but the asker there actually wants those characters in the output. I do not. I couldn't care less if a few characters per document are missing. I just want LaTeX to discard those characters.

I'm using pdflatex for compatibility reasons, so I'd like to avoid xetex or luatex as changing the rendering engine will probably require us to retest all the documents.

Minimal error example:


\documentclass[a4paper]{article}
\usepackage[utf8]{inputenx}
\usepackage[croatian]{babel}

\begin{document}

Sometimes crawler gets letters that look the same but throw errors, like: O or о (Cyrillic) instead of Latin O or o

\end{document}

In this example I could probably tell babel that it's Cyrillic text and it would work, but I can't know beforehand what crazy character will the crawler pick up.

ALSO: another option would be to get a list of all the characters that pdflatex DOES know how to work with, and I'll discard programmatically all the other characters.

Levara
  • 213
  • 3
    You could (and probably should) delete \usepackage[utf8]{inputenx} as utf8 (following inputenc conventions) is already the default ecoding in current latex – David Carlisle Jul 27 '22 at 22:19
  • 1
    Actually the answer under this question looks better... maybe reverse the direction. – user202729 Jul 28 '22 at 16:31
  • @user202729 did you mean this? https://tex.stackexchange.com/a/652216/41953 See the comment below. – Levara Jul 29 '22 at 08:18

4 Answers4

17

You can redefine the error message:

\documentclass[a4paper]{article}
\usepackage[croatian]{babel}

\makeatletter \def\UTFviii@undefined@err#1{??UPS??} \makeatother \begin{document}

Sometimes crawler gets letters that look the same but throw errors, like: O or о (cyrillic) instead of latin O or o

\end{document}

enter image description here

Since 2018 utf8 in the default encoding, so loading inputenc or inputenx is not needed but doesn't harm. If inputenc is loaded in the document, it must be loader before the redefinition of the error message so that it doesn't overwrite the message again.

Ulrike Fischer
  • 327,261
  • I tried this but I don't get anything sensible. It's ignored. It seems that it does indeed compile, but it still throws error. And since my system expects the success return code, I still have the problem. Is there a way to remap that error into a warning? – Levara Jul 29 '22 at 08:15
  • it works fine for me with a current LaTeX and doesn't error. It should also work in older LaTeX, I tried back to texlive 2014. But you need to remove the inputenx package, or at least ensure that the definition is after the package has been used and that you don't use anything that changes the command again. And naturally ensure that the errors you get now are really from unknown utf8 chars and not something else. – Ulrike Fischer Jul 29 '22 at 08:23
  • so I should completely remove inputenc package? Why? Babel should be left in? My version is the one available in ubuntu 20.04 repos: pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=pdflatex 2020.10.19) – Levara Jul 29 '22 at 08:33
  • utf8 is the default input encoding in LaTeX since 2018, so you no longer need the inputenc or inputenx package. Try it out. – Ulrike Fischer Jul 29 '22 at 08:36
  • Ok, this works without inputenc package. Many thanks. Maybe you could update the answer explicitly stating that inputenc is not necessary and it breaks your answer.

    I'll mark this answer as accepted, since it more elegantly solves the problem, even though I implemented David's answer as well. Sorry David :)

    – Levara Jul 29 '22 at 09:03
11

I'd use Ulrike's answer, but to answer your ALSO question, you can get a list of Unicode code points known to the base LaTeX distribution from

/usr/local/texlive/2022/texmf-dist/tex/latex/base/utf8enc.dfu

(or whatever path kpsewhich utf8enc.dfu gives on your system)

David Carlisle
  • 757,742
0

Pass your scraped data through a filter that removes all poorly-behaved Unicode characters. For example, a regex search-and-replace on all characters that do not match the allowed set, with the empty string.

One possible example, in PERL:

perl -CSD -pe "s/[^\p{ASCII}]//gu"

This is just one example, and you would want to change the regex to include all the characters you actually use (as in Levara’s answer). You might also want to perform character-set conversion first.

Davislor
  • 44,045
  • This will remove all characters like č,ć,ž, or german letters with accents? I'm not looking to convert the text to ascii, I'm looking to ignore unicode errors. – Levara Jul 29 '22 at 08:14
  • @Levara That was a simple example I happened to have on hand. You can, of course, tweak the regex to eliminate the right characters. – Davislor Jul 29 '22 at 17:18
0

Basically, thanks to @David Carlisle's answer, I've managed to extract all the Unicode codes that LaTeX does know how to print, and I've created a simple function in ruby that only keeps those characters in the string, while replacing all the other ones with a ? character.

### not the prettiest code in the world but it works for me. 
### First we remove all the invalid utf8 characters, then we remove
### all unprintable characters, and then we remove all the characters
### that LaTeX does not know how to render.
   def utf8_tex(txt)
    begin
       out = txt&.chars&.select(&:valid_encoding?)&.join
       out2 = out&.gsub(/[^[:print:]]/,'')
    rescue
      puts "fail on remove invalid"
    end
tex_valid = [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
             172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
             184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
             196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
             208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,
             220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231,
             232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243,
             244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
             256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267,
             268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
             280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291,
             292, 293, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305,
             306, 307, 308, 309, 310, 311, 313, 314, 315, 316, 317, 318,
             321, 322, 323, 324, 325, 326, 327, 328, 330, 331, 332, 333,
             334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345,
             346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357,
             360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371,
             372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 402,
             461, 462, 463, 464, 465, 466, 467, 468, 482, 483, 486, 487,
             488, 489, 490, 491, 496, 500, 501, 536, 537, 538, 539, 562,
             563, 567, 710, 711, 732, 728, 729, 731, 733, 1024, 1025, 1026,
             1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036,
             1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046,
             1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056,
             1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066,
             1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076,
             1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086,
             1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096,
             1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106,
             1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116,
             1117, 1118, 1119, 1122, 1123, 1130, 1131, 1138, 1139, 1140,
             1141, 1142, 1143, 1164, 1165, 1166, 1167, 1168, 1169, 1170,
             1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180,
             1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190,
             1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200,
             1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210,
             1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220,
             1221, 1222, 1223, 1224, 1227, 1228, 1229, 1230, 1232, 1233,
             1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243,
             1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253,
             1254, 1255, 1256, 1257, 1260, 1261, 1262, 1263, 1264, 1265,
             1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275,
             1276, 1277, 1278, 1279, 3647, 7682, 7683, 7838, 8204, 8208,
             8209, 8210, 8211, 8212, 8213, 8214, 8216, 8217, 8218, 8220,
             8221, 8222, 8224, 8225, 8226, 8230, 8240, 8241, 8249, 8250,
             8251, 8253, 8260, 8270, 8274, 8353, 8356, 8358, 8361, 8363,
             8364, 8369, 8451, 8470, 8471, 8478, 8480, 8482, 8486, 8487,
             8494, 8592, 8593, 8594, 8595, 9001, 9002, 9250, 9251, 9702,
             9711, 9834, 10216, 10217, 7712, 7713, 64256, 64257, 64258,
             64259, 64260, 64261, 64262, 65279]

clean = out2.unpack('U*').map{|x| (x<160 or tex_valid.include?(x)) ? x : 63}.pack('U*')

end

Credit where credit is due: pack/unpack

Levara
  • 213
  • If it would be interesting, I could expand the answer with the details of how I extracted all the chars from the file David Carlisle suggested. – Levara Jul 29 '22 at 08:52
  • the list of chars LaTeX handles can change, for example if you load font encodings which have an accompanying .dfu, or if some package or document declares more chars. – Ulrike Fischer Jul 29 '22 at 09:40