List of alphabet and other character data per language as structured data?

Question

I know the unicode dataset has information about "this character has numerical value X", and such, and it also has sometimes a sort order for the characters. But I don't think it has information on which characters are used in which alphabets/writing systems, and also which characters are numbers and such.

Has anyone collected this information and structured it in some way in CSV or JSON? I know you can find this information on Wikipedia in random unstructured tables, but I would like to find it already aggregated if possible.

Things like the Finnish alphabet, which uses the Latin unicode block, has different letters and order than the English alphabet, etc... And in the Hebrew alphabet, certain characters are given numerical meaning and whatnot. Does any of this data exist structured somewhere? In Tibetan, each character has a pronunciation associated with it, and also characters sometime have names in the native language. That sort of stuff.

I saw https://character-table.netlify.app/ but it looks like simple unicode mappings, so not sure if anything else exists. Omniglot has charts, which are spreadsheets with many alphabets, but they would have to be heavily reworked as the xls format they are in is not currently easily convertible to JSON.

Nicolas Raoul · Answer 1 · 2023-09-08T05:06:01.780

1

If Wikipedia has the unstructured data, chances are Wikidata has the structured data. :-)

After finding natural writing system in Wikidata, to get you started I wrote this query:

SELECT ?naturalWritingSystem ?naturalWritingSystemLabel ?unicodeRange
WHERE 
{
  ?naturalWritingSystem wdt:P31 wd:Q29517555. # Must be of a cat
  OPTIONAL{
    ?naturalWritingSystem wdt:P5949 ?unicodeRange.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Result sample:

You can run the query here: https://w.wiki/7QfQ and select JSON as the output. It will be table-shaped JSON though, with the left columns repeated where needed.

To answer your question in more details, using more properties and cross-referencing with character or grapheme (possibly going from the unicode ranges in my screenshot above) is needed.

edited Sep 08 '23 at 05:06

answered Sep 07 '23 at 01:48

Nicolas Raoul

8,426
5
28
61

Interesting! Will this give the Hebrew name of the glyph in Hebrew text, as well as the associations to other characters, as per all the various tables here? Wikidata Sparql is hard to work with / time consuming, so before investing all the effort would like to know it's probably there :) – Lance Sep 07 '23 at 19:12
I just tried now, it seems that not all of the information is in Wikidata yet unfortunately :-/ – Nicolas Raoul Sep 08 '23 at 04:59

List of alphabet and other character data per language as structured data?

1 Answers1