5

As an example, I would like to sort the following characters by some sort of complexity measure:

Join[CharacterRange["b", "z"], CharacterRange["β", "ω"], CharacterRange["ㄱ", "ㅣ"]]

Out:

(* {b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,
    y,z,β,γ,δ,ε,ζ,η,θ,ι,κ,λ,μ,ν,ξ,ο,π,ρ,ς,σ,τ,υ,φ,χ,
    ψ,ω,ㄱ,ㄲ,ㄳ,ㄴ,ㄵ,ㄶ,ㄷ,ㄸ,ㄹ,ㄺ,ㄻ,ㄼ,ㄽ,ㄾ,ㄿ,ㅀ,ㅁ,ㅂ,
    ㅃ,ㅄ,ㅅ,ㅆ,ㅇ,ㅈ,ㅉ,ㅊ,ㅋ,ㅌ,ㅍ,ㅎ,ㅏ,ㅐ,ㅑ,ㅒ,ㅓ,ㅔ,ㅕ,ㅖ,
    ㅗ,ㅘ,ㅙ,ㅚ,ㅛ,ㅜ,ㅝ,ㅞ,ㅟ,ㅠ,ㅡ,ㅢ,ㅣ},ㄱ,ㄲ,ㄳ,ㄴ,ㄵ,ㄶ,ㄷ,
    ㄸ,ㄹ,ㄺ,ㄻ,ㄼ,ㄽ,ㄾ,ㄿ,ㅀ,ㅁ,ㅂ,ㅃ,ㅄ,ㅅ,ㅆ,ㅇ,ㅈ,ㅉ,ㅊ,ㅋ,ㅌ,
    ㅍ,ㅎ,ㅏ,ㅐ,ㅑ,ㅒ,ㅓ,ㅔ,ㅕ,ㅖ,ㅗ,ㅘ,ㅙ,ㅚ,ㅛ,ㅜ,ㅝ,ㅞ,ㅟ,ㅠ,ㅡ,ㅢ,ㅣ} *)

One may also consider other symbols such as a list of emojis.

Any measure that roughly sorts the characters/symbols in terms of their complexity would do.


This post was originally a question but it seems to have taken the form of a question-answer. The purpose of this post is to share ideas so please do not hesitate commenting or posting answers with other methods or improvements to my answer below.

userrandrand
  • 5,847
  • 6
  • 33
  • 2
    As a starting point, have a look on these articles: https://community.wolfram.com/groups/-/m/t/2509775 and https://community.wolfram.com/groups/-/m/t/2516662?p_p_auth=oCPa5OZW – demm Oct 17 '22 at 13:47
  • @demm that looks quite cool and it does seem relevant to the question, thank you ! – userrandrand Oct 17 '22 at 13:55

1 Answers1

6

Algorithms

Here are some possible complexity measures. Note that interestingly each is founded in a different area of mathematics.

1) Memory cost of JPEG conversion (cosine transform: analysis)

Note: This method could give the wrong idea when the characters are not all the same size. The size of the character might be an explanation to the memory cost. See last sentence at the end.

imagecomplexity = 
ByteCount@ExportByteArray[#, "JPEG", ImageResolution -> 500] & 

2) Image entropy (entropy: information theory)

Note: I do not know how this is implented in Mathematica but this stack exchange discussion seems to say it is better to calculate the entropy of derivatives of the image. With an x-derivative I did not notice a big difference.

imagecomplexity2[x_] := 
ImageMeasurements[Rasterize[x, ImageResolution -> 500], "Entropy"]

3) Number of dominant singular values (singular values: linear algebra)

imagecomplexity3[x_] := 
Module[{cl}, 
cl = Rasterize[x, ImageResolution -> 500] // Binarize // ImageData //
N // SingularValueList; 
Select[cl, # > 0.02*Total[cl] &] // Length]

Note: Select[cl, # > 0.02*Total[cl] &] in the above Module can be replaced by cl // N // Threshold[#, 0.02*Total@#] & // Pick[#, Thread[# != 0]] & // Length which was about 14 times faster.

4) Maximum intersections of the character/symbol with a randomly moving and randomly oriented line (geometric intersections: geometry)

Note: One might want to change ParallelTable to Table in the code below for whatever reason. I personally find this method gives nice results but it can take some time to obtain a result for many characters.

randomIntersections[char_, numpts_] :=
  Module[{im, bounds, il, randomPoint, randomDirection, 
    boundingrectangle, line},
    im = char // Rasterize[#, ImageResolution -> 500] & // ColorNegate //
    Binarize // ImageMesh // CanonicalizePolygon;
    boundingrectangle = BoundingRegion@im;
    il = ParallelTable[randomPoint = RandomPoint@boundingrectangle;
    randomDirection = RandomPoint@Circle[];
    line = InfiniteLine[randomPoint, randomDirection];
    Length@Flatten[List @@ RegionIntersection[RegionBoundary@im, line], 1]/2
    , numpts, Method -> "CoarsestGrained"];
 Max@il
 ]

Tests

Note : The symbols below look a bit different than on Mathematica. Also the results might be affected by how the characters are transformed into images. Perhaps a preprocessing step that simplifies the images into more simple lines is needed.

Personal favorites

My personal favorite in terms of accuracy is the method that searches for the maximum number of intersections with a randomly moving infinite line (number 4 below). However, it is a bit slow. My second favorite is the JPEG method which is faster. I do not trust that the entropy method is aware of 2d spatial structure but I do not know how it is implemented.

Form of outputs below

The outputs below show associations whose elements follow the pattern complexity -> characters where characters are grouped by their common complexity.

1) JPEG

Table[{char, imagecomplexity@char}, {char, test}] // 
GroupBy[#, Last -> First] & // KeySort

<|2792->{i,ι,ㅡ},2920->{r,ㄴ,ㅣ},3048->{l,ㄱ,ㅏ},3176->{f,t,τ,ㅓ,ㅜ},3304->{j,u,ㄷ,ㅗ,ㅛ,ㅠ},3432->{n,z,ν,ㄲ,ㅋ,ㅑ,ㅕ},3560->{h,v,ε,ς,υ,ㅁ,ㅢ},3688->{c,k,λ,π,ㅂ,ㅅ,ㅌ,ㅍ,ㅐ},3816->{e,m,s,σ,ㄹ,ㅒ,ㅚ,ㅟ},3944->{d,o,q,x,η,κ,μ,ο,ㄸ,ㄺ,ㅔ},4072->{b,p,γ,ρ,ㄳ,ㅈ,ㅖ,ㅘ,ㅝ},4200->{y,ω,ㄾ,ㅃ},4456->{w,δ,ξ,χ,ㄵ,ㄻ,ㄼ,ㄽ,ㄿ,ㅄ,ㅆ,ㅇ,ㅊ,ㅎ,ㅙ},4712->{ζ,θ,φ,ψ,ㄶ,ㅉ},4968->{β,ㅀ,ㅞ},5224->{g}|>

2) Singular values

Table[{char, imagecomplexity2@char}, {char, test}] // 
GroupBy[#, Last -> First] & // KeySort

<|2->{ㅡ,ㅣ},3->{ㄱ,ㄲ,ㄴ,ㄷ,ㅁ,ㅂ,ㅃ,ㅌ,ㅏ,ㅑ,ㅓ,ㅕ,ㅗ,ㅛ,ㅜ,ㅠ},4->{i,ㄸ,ㄹ,ㅋ,ㅍ,ㅐ,ㅒ,ㅔ,ㅖ,ㅢ},5->{f,j,l,m,n,r,u,x,ㄳ,ㄾ,ㅄ,ㅅ,ㅈ,ㅊ,ㅚ,ㅟ},6->{b,c,d,e,h,o,p,q,s,t,v,ι,ㄵ,ㄶ,ㄺ,ㄻ,ㄼ,ㄽ,ㄿ,ㅆ,ㅇ,ㅉ,ㅎ,ㅘ,ㅝ},7->{g,k,w,z,ε,ς,τ,υ,ㅀ,ㅙ,ㅞ},8->{y,γ,η,θ,λ,μ,ν,ξ,ο,π,σ,φ,χ,ψ,ω},9->{β,δ,ζ,κ,ρ}|>

3) Entropy

Table[{char, imagecomplexity3@char}, {char, test}] // 
GroupBy[#, Last -> First] & // KeySort

<|0.235058->{ㅡ},0.241284->{ι},0.271079->{ㅣ},0.274011->{ㄱ},0.276856->{ㄴ},0.304594->{τ},0.317419->{ㅏ},0.319768->{ㅜ},0.325027->{ㅓ},0.327208->{ㅗ},0.32967->{ν},0.332074->{ㅅ},0.364863->{r},0.365673->{ㅑ},0.366511->{ㅕ},0.368454->{υ},0.370989->{ㄷ},0.371001->{ς},0.374665->{i},0.382829->{ε},0.383106->{ㄲ},0.394432->{π},0.400003->{ㅛ},0.401677->{ㅋ},0.407224->{κ},0.413434->{ㅈ},0.414058->{ㅠ},0.41415->{ㅢ},0.415045->{λ},0.426538->{ㄳ},0.42797->{ο},0.428693->{μ},0.432186->{ㅇ},0.432606->{γ},0.434306->{σ},0.435502->{η},0.436982->{ㅊ},0.438818->{ㅁ},0.44237->{ㅍ},0.443258->{l},0.445108->{ㅂ},0.447132->{ㅆ},0.451915->{v},0.458128->{ω},0.458387->{ㅌ},0.458857->{ㄸ},0.458868->{ㄹ},0.461452->{ㅚ},0.461714->{ㄵ},0.463049->{ρ},0.466197->{χ},0.468201->{ㅎ},0.473909->{ㅐ},0.479558->{c},0.480259->{ㄺ},0.480597->{ㅔ},0.487572->{ㅟ},0.487999->{t},0.492021->{n},0.492295->{f},0.493094->{z},0.493322->{ㅒ},0.493751->{ㅘ},0.49423->{ㄶ},0.494514->{x},0.498986->{ㄽ},0.499355->{u},0.508466->{ㅄ},0.509855->{ㅖ},0.516118->{ξ},0.516527->{j},0.520078->{δ},0.524228->{ㅝ},0.529499->{φ},0.537595->{ㄾ},0.538682->{ψ},0.540099->{ㅉ},0.541803->{s},0.552829->{h},0.553162->{y},0.555202->{ㅃ},0.555449->{θ},0.558212->{ㄻ},0.560454->{ζ},0.562477->{o},0.565689->{ㄿ},0.565989->{ㅀ},0.567774->{ㄼ},0.568545->{k},0.574654->{β},0.585289->{e},0.601765->{ㅙ},0.619914->{m},0.623035->{ㅞ},0.626322->{p},0.628793->{b},0.630582->{q},0.631354->{d},0.667572->{w},0.767331->{g}|>

4) intersections with random lines

Table[{char, randomIntersections[char, 400]}, {char, test}] // 
GroupBy[#, Last -> First] & // KeySort

<|1->{ㅡ,ㅣ},2->{c,i,o,r,v,x,ι,ο,ㄱ,ㄴ,ㄷ,ㅁ,ㅇ,ㅏ,ㅓ,ㅗ,ㅜ,ㅢ},3->{b,d,e,f,h,j,k,l,n,p,q,s,t,u,y,z,γ,θ,ν,π,ρ,ς,σ,τ,υ,φ,ㄲ,ㄵ,ㄸ,ㄹ,ㅂ,ㅅ,ㅈ,ㅊ,ㅋ,ㅌ,ㅐ,ㅑ,ㅔ,ㅕ,ㅖ,ㅘ,ㅚ,ㅛ,ㅟ,ㅠ},4->{m,w,β,δ,ε,ζ,η,κ,λ,μ,χ,ψ,ω,ㄳ,ㄶ,ㄺ,ㄻ,ㄼ,ㄽ,ㄾ,ㅀ,ㅄ,ㅆ,ㅉ,ㅍ,ㅎ,ㅒ,ㅝ},5->{g,ξ,ㄿ,ㅃ,ㅙ,ㅞ}|>

[ Edit: Answering the comment about why g has 5 intersections. The letters here do not not necessarily appear the same way in a Mathematica notebook. The image below shows how g is represented on my notebook and the 10 intersections with an infinite line (10/2 if the character was infinitely thin)

g

]

For fun/curiosity

I used the entropy and JPEG methods to sort almost all the characters in Mathematica by image complexity (10 to 20 min each on a i7 laptop). I removed the characters that failed to convert to expressions and i removed uppercase letters. The entropy and JPEG give similar results for the cases below except for the top 10 that are quite different.

Note: I am no longer sure which resolution I chose for the results below. Perhaps it was ImageResolution->200.

Entropy

The letter "a" is at position 7680 among the 53917 characters in the list I considered. The 10 symbols before and after are (the \[FormalF] does not render on stack exchange):

{ொ, ꭠ, ꕎ, ḥ, ☭, ፓ, ㆀ, ﶔ, \[FormalF], ભ, a, 군, သ, ᕗ, 츠, ⰸ, ഡ, ﮀ, ウ, ᾨ, ꋭ}

  • The first 10 among the top 75% most complex :

{ᘜ, ຢ, 㲹, 갖, ⱷ, ꩷, 껸, 氿, ᩀ, 昈, 訌}

  • The first 10 among the top 50% :

{夆, 采, ⾡, 쐣, 씑, 묆, 펳, 뒱, 뢤, 閃, 荝}

  • The first 10 among the top 25%

{旋, 㸆, 傿, 餁, 嶞, 㑚, 鞝, 䅈, 誃, 镍, 䯮}

  • The last 10:

{䨻, 䨺, 鿩, 鿧, ⛓, ☸, 龘, 鿡, ⛲, ▓}

JPEG

The letter "a" is at position 5419 among the 53917 characters in the list I considered. The 10 symbols before and after are (\:03d9 does not render on stack exchange) :

{ᴑ, ჺ, ⁕, ✰, m, ɜ, q, ɘ, s, \:03d9, a, e, ç, ↺, қ, ҡ, ʊ, ṿ, щ, ų, x}

Some peculiar elements in the top 10 are:

triangle

circulation_integral

double_circulation_integral

I suppose this is because of their unusual size (to see copy paste {a,\:f06e, \:f04c, \:f04a}) which maybe bloats the memory size of the JPEG compression. The entropy method seems less susceptible to character size modifications.

userrandrand
  • 5,847
  • 6
  • 33
  • Nice! Also our of curiosity, could you illustrate the random line crossings for "g" that give it a score of 5? I'm having visualisation problems with that as shown! – Julian Moore Oct 20 '22 at 08:07
  • @JulianMoore Hi thanks. The "g" is not the same on my notebook. I added a picture in the post showing the intersections with that "g". That "g" looks a bit tedious to draw so i guess it's not too surprising that it has high complexity scores with a lot of the methods. – userrandrand Oct 20 '22 at 13:13
  • Now it makes sense:) Thanks! – Julian Moore Oct 20 '22 at 17:50