10

I am looking for datasets or huge lists of human forenames. There's plenty of websites that curate lists of names. But none of these seems to offer functionality to export either raw data/lists of names, nor to list more than a few dozen names per page.

My criteria to the data are:

  • each entry needs be unique
  • each entry needs be human readable (not just a random collection of letters such as: quwertzpl)
  • entries need to number 10'000+ (ten thousand and/or more)
Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47
dot_Sp0T
  • 203
  • 2
  • 6

6 Answers6

16

The best source of international human given (first) names comes from a German computer magazine. The text file has nearly 50k names that are classified by likely gender, and how popular in each country. It's carefully curated and has a friendly license (GNU Free Documentation License 1.2).

The file can be downloaded here : ftp://ftp.heise.de/pub/ct/listings/0717-182.zip (name_dict.txt contains the data).

Archive Link: https://web.archive.org/web/20200414235453/ftp://ftp.heise.de/pub/ct/listings/0717-182.zip

Instead of parsing this file, you can use the python port SexMachine (really) - package and github repo. I'm sure other languages have their own ports. There is also a windows executable (details).

(my reference)


For US baby names, you can use the Social Security Admin's download (overview) and link to data. This data can be national or on the state level, and going back to the late 19th century.

To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.

You'll also find ports of this data to various languages.

philshem
  • 17,647
  • 7
  • 68
  • 170
5

If you are a R user, you can download the SSA 'babynames' dataset directly in R via the package 'babynames' (from the great Hadley Wickham) which is on CRAN: here

jeborsel
  • 209
  • 1
  • 3
2

Here are csv files containing first names and surnames on a cool data-sharing platoform: https://data.world/len/us-first-names-database

Noah
  • 71
  • 1
  • 1
1

The best source of international human given (first) names comes from official statistics provided from states. Damegender has done an open data collection provided from multiple states (austria,australia,belgium,canada,german,denmark,spain,finland,great britan,ireland,island,mexico,new zealand,portugal,uruguay,slovenia, united states of america, ...)

More information:

user949842
  • 31
  • 1
  • 680593 names, 347363 for females, 333230 for males. 16 countries. The claimed amount of names is impressive, but only for 16 countries? Seems suspect to me :/ – dot_Sp0T Apr 23 '21 at 09:20
1

For structured data of that size with associated metadata, you probably want something like census.name (requires a paid license) or philipperemy/name-dataset (free as in beer).

You can also try querying Wikidata, per this answer. I have personally had good luck getting useful data out of Wikidata, but it typically involves many hours of iterative query building for complex queries, and I usually end up eschewing the built-in SPARQL system for BigQuery before I'm done, which can get expensive. That said, the scope of the data is unreal, and it's free as in speech and free as in beer.

There are smaller datasets like sigpwned/names-by-country-dataset (free as in speech, free as in beer) that are more in the 1Ks than 10Ks, but might at least help you get started. (Disclaimer: I am the author of this dataset.)

sigpwned
  • 111
  • 4
0

http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data

For information on SSA dataset.

Hyon Kim
  • 137
  • 1
  • 1