6

Does anyone know of anyone else who has compiled a machine-readable set of the IRS's SOI Tax Stats - Historical Table 2, i.e., these?

The .xls files available from the IRS are crosstabbed and contain multiple subcategories. Difficult to translate to something database compliant for analysis.

Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47
JMcClure
  • 293
  • 1
  • 7
  • 1
    i'd say these already are open data. importing them into a single file would take maybe an hour of coding in R.. – Anthony Damico Apr 24 '14 at 00:35
  • 2
    I think you guys may be underestimating the nuances in the file structure across years. I've been at it with python for a while now. Of course, it's equally likely you're overestimating my programming ability, but both are sort of off-topic. – JMcClure Apr 24 '14 at 12:14
  • 1
    @Mac : could you tell is what questions you're trying to answer from the data, as that might affect recommendations of how the data needs to be converted. – Joe May 07 '14 at 12:25

2 Answers2

2

Been slightly inattentive here, but for posterity's sake, I wanted to post the results of my cleaning effort: machine-readable SOI data.

Like most intensive data cleaning I've been a part of, it wasn't the result of a single programming effort. For example, a fair bit of consideration went in to how to reconcile the individual annual series. The coverage.csv in the repo shows the series coverage for 1997-2011. Enjoy!

JMcClure
  • 293
  • 1
  • 7
1

using scraperwiki, i can parse the .xls just fine. place the url in the input field, after it is done uploading, select download as spreadsheet, and you'll get .csv/.xlsx.

not an answer to the question(s) you posed directly, but an answer nonetheless. hope its useful...

albert
  • 11,885
  • 4
  • 30
  • 57
  • 2
    Can you please expand on the answer? – Jeanne Holm May 05 '14 at 03:53
  • sure, but not sure on what part. can you be more specific? – albert May 05 '14 at 13:50
  • um ... explain how to convert crosstabbed .xls files to something for analysis? .. so it's then an answer to the question being asked. If not, this should be a comment and not an answer. – Joe May 06 '14 at 10:20
  • totally lost me on that one @joe....sorry. change this to a comment? i think it does what was asked...but if you disagree, i'll edit – albert May 06 '14 at 23:39
  • @albert : My take on the question was that the second paragraph was the significant part -- not conversion of .xls to .xlsx or .csv – Joe May 07 '14 at 12:25
  • i thought i got that too...csv conversion helps there. i should note you get each table in its own csv, as well as another csv containing all the data....if that's off, sorry, lmk, i'll delete the whole thang – albert May 07 '14 at 16:47
  • The question is not about parsing a single XLS, but normalizing many XLS with possibly subtle changes per state and year and how to cope with that. – ojdo Jun 23 '14 at 11:29