Is there an efficient way to access AdministrativeDivisionData?

Question

There is a ton of good data buried in the Wolfram data stores; however, trying to access that in bulk tends to be excruciatingly painful since the access constructs are designed for accessing a single nugget at a time.

My current issue is that I'm pulling down the NY Times COVID-19 data from

https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

and I'd like to integrate it with demographic data.

This winds up being pig-dog slow with each set of characteristics requiring 5 seconds for each of the counties. To illustrate, let's pull some data for Texas

focusState = "Texas";

adminDataFields = {"Population", "PopulationDensity", 
   "MedianHouseholdIncome", "AverageCommute", "PersonsPerHousehold", 
   "MedianAge"};

These are the Texas counties which have had a COVID-19 report according to the NY Times. The list can grow over time (I'm actually interested at the national level) so I have to refresh the list as more counties get added.

focusStateCounties = {"Harris", "Dallas", "Tarrant", "Bexar", 
   "Travis", "Fort Bend", "Denton", "Collin", "Galveston", "El Paso", 
   "Lubbock", "Montgomery", "Brazoria", "Webb", "Hidalgo", "Cameron", 
   "Brazos", "Williamson", "Jefferson", "Bell", "Smith", "Hays", 
   "Nueces", "McLennan", "Victoria", "Potter", "Randall", "Wichita", 
   "Matagorda", "Guadalupe", "Ellis", "Hardin", "Gregg", "Bowie", 
   "Ector", "Taylor", "Orange", "Midland", "Johnson", "Washington", 
   "Comal", "Tom Green", "Nacogdoches", "Shelby", "Chambers", 
   "Kaufman", "Wharton", "Walker", "Bastrop", "Fayette", "Coryell", 
   "Moore", "Liberty", "Hunt", "Grayson", "Angelina", "Rusk", 
   "Rockwall", "Donley", "Calhoun", "Val Verde", "Hood", "Hockley", 
   "Harrison", "Gray", "Waller", "Navarro", "Erath", "Castro", 
   "Andrews", "Wilson", "Medina", "Kendall", "Hale", "Van Zandt", 
   "Polk", "Lamar", "DeWitt", "Brown", "Starr", "San Patricio", 
   "San Augustine", "Panola", "Milam", "Maverick", "Limestone", 
   "Deaf Smith", "Austin", "Uvalde", "Upshur", "Terry", "Parker", 
   "Hill", "Henderson", "Colorado", "Cherokee", "Willacy", "Jasper", 
   "Grimes", "Dawson", "Cass", "Caldwell", "Burnet", "Burleson", 
   "Atascosa", "Wood", "Tyler", "Titus", "Palo Pinto", "Lavaca", 
   "Jackson", "Hopkins", "Fannin", "Blanco", "Young", "Wise", 
   "Trinity", "San Jacinto", "Oldham", "Lynn", "Llano", "Live Oak", 
   "Goliad", "Eastland", "Comanche", "Clay", "Camp", "Swisher", 
   "Robertson", "Morris", "Martin", "Leon", "Lee", "Lampasas", 
   "Kleberg", "Kerr", "Karnes", "Jim Wells", "Hutchinson", "Gonzales",
    "Gillespie", "Crane", "Aransas", "Anderson", "Zapata", "Scurry", 
   "Sabine", "Pecos", "Newton", "Montague", "Mitchell", "McCulloch", 
   "Mason", "Madison", "Lamb", "Knox", "Jones", "Jack", "Hemphill", 
   "Hansford", "Hamilton", "Gaines", "Frio", "Franklin", "Floyd", 
   "Falls", "Dickens", "Delta", "Dallam", "Crosby", "Callahan", "Bee",
    "Bandera", "Yoakum", "Hall", "Cooke"};

As an aside, Wolfram is twitchy on it's lookups so, if you want to use the NY Times names, some conditioning is required along the lines ...

mmaLocations = Map[
  StringDelete[
    Capitalize[ StringDelete[#, "city" | "borough" | "-" | "'" | "."],
      "AllWords"], Whitespace] &,
  countyAndStateDoublets,
  {2}
  ]

With that introduction, executing the line below takes over five seconds for me from my office here in Michigan. Multiply 5 seconds by the number of counties in the US and you wind up at a large number.

AbsoluteTiming[
 QuantityMagnitude[
    AdministrativeDivisionData[{StringDelete[#1, Whitespace | "." | "-"], 
      StringDelete[focusState, Whitespace | "."], "UnitedStates"}, 
     adminDataFields]
    ] &@"Brazos"
 ]

Obviously, once I let this grind for the excruciatingly long time required, I archive it to disk and only refresh my demographics every few days.

What I'm hoping is that I'm missing something obvious here since I run into this sort of situation fairly regularly where the Wolfram data access seems to be designed for interactive rather than programmatic (Wolfram Language) access.

Any suggestions?

Thanks,

Mark.

Victor K. · Answer 1 · 2020-04-11T00:40:25.453

The secret is to download the data in bulk and do the mapping via FIPS codes. I just happened to analyze the same data few days ago (took me 82 seconds right now but usually it's faster):

nydata = Dataset[
  SemanticImport[
   "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-\
counties.csv",
   HeaderLines -> 1], MaxItems -> 10]

Note that SemanticImport returns FIPS code as integers, not strings, that's why we need to use FromDigits below. I also did the merge before I learned about JoinAcross, so the last part should be much much simpler - but that's the code I have lying around. Finally, since I was only interested in the latest data, I used MaximalBy - you would need to modify the code accordingly if you need e.g. time series.

covid = Dataset[Values@Merge[
     {Normal@
       counties[All, {"FIPSCode" -> FromDigits}][
        GroupBy["FIPSCode"]],
      Normal@
       nydata[GroupBy["fips"], 
        MaximalBy[#date &], {"cases", "deaths"}]},
     Join[First@First@#, First@Last@#] &]][
  Select[KeyExistsQ[#, "cases"] && KeyExistsQ[#, "Name"] &]]

BTW, I did the merge before I learned about JoinAcross, so the last part should be much much simpler - but that's the code I have lying around. — Victor K., Apr 11 '20 at 00:34
Thank you Victor for your answer. As I posted below, I used SemanticInterpretation to generate the desired entity. Is there a better or more general approach? — Mark Kotanchek, Apr 12 '20 at 22:19

score 0 · Accepted Answer · answered Apr 12 '20 at 22:16

Thanks to Victor, I realized that the magic ju-ju required was to specify the required Entity at the proper level. After some trial-and-error, I found that SemanticInterpretation was my friend ...

adminDataFields = {"FIPSCode", "Population", "PopulationDensity", 
   "MedianHouseholdIncome", "AverageCommute", "PersonsPerHousehold", 
   "MedianAge", "CapitalCity"};
wolframEntity =  SemanticInterpretation["All US Counties"];

AbsoluteTiming[
  retrievedDataWolfram = wolframEntity[adminDataFields];
 Dimensions@retrievedDataWolfram
 ]

{117.543, {3143, 8}}

Obviously, as Victor pointed out, rather than fight with parsing strings using the FIPSCode greatly simplifies the integration of the two data sources. Let's hear it for not ignoring standards and conventions.

Using SemanticInterpretation to create the proper EntityClass works but whether it is the optimal strategy requires the insights of those more knowledgeable than me.

Is there an efficient way to access AdministrativeDivisionData?

2 Answers2