4

Say one needs to parse text for any mention of nationality or ethnicity, or otherwise use these terms in the course of an NLP project. Is there a comprehensive resource of English words describing these? The interest is in both adjectives and nouns, e.g. "Turkish" and "Turk."

Maître Peseur
  • 560
  • 1
  • 4
  • 17
  • @aeroNotAuto The same thought crossed my mind, and I can see reasons for it fitting in either. In the end I put it here because the focus was on a tool to be used in analysis, irrespective of whether that tool is an open dataset. Anyways, I leave it up to the mods. :) –  Jun 18 '15 at 21:39

2 Answers2

6

Both are available in the CIA World Factbook. Nationality names, both noun and adjective, for world countries are available here; ethnicities, listed by country, here. (This may not be historically comprehensive.)

The promised tabular data referenced in the 'Technical' section of the FAQ doesn't seem to exist for this dataset: Light scraping required.

5

WordNet is a (free) semantic database of the English language. You could query for example the complete hyponym tree of the term "inhabitant". Depending on your application, it could be a disadvantage that WordNet also contains historical ethnicities or nationalities.

For language names, an excellent database is Glottolog.

Once I compiled a dictionary of 200 language names as a csv file in this GitHub repository.

Suzana
  • 406
  • 2
  • 11