5

I'm developing an ebook for a publishing company on Data Science. I'm hunting for a dataset that would be appropriate for this. I've seen many tutorials use iris, but I don't want to - I want to use a larger dataset that allows the audience to have some experience with something that's more realistic.

I welcome suggestions for good datasets that would illustrate various aspects of data science - data collection, data munging & cleaning, visualisation, and modelling.

Here are some of the criteria I'd ideally love to see in the dataset -

  1. The dataset needs to be available for private and for-profit use. I'm happy to give credit to those that curated it, though (this is crucial)
  2. The dataset should be long enough - examining by hand is not an option (which, in my opinion, iris is). This also allows me to demonstrate sampling and dealing with at least moderately-sized data. I'm thinking about something in the 100k row range.
  3. The dataset can (and should!) be slightly unclean - I'd like to give the audience some sense of which rows should be removed for analysis
  4. The data should be wide enough - have a reasonable number of attributes (10-20, maybe?) to allow for some feature selection, feature elimination, and feature engineering.
  5. A mix of numeric, string, dates, categorical data to demonstrate operations and challenges dealing with each of them
  6. Multiple predicted variables (some classification, some regression) so that I can demonstrate different techniques. One example could be loan data with a 'Defaulted (Yes/No)' as we as 'Income (in USD, EUR, INR, etc.)'
  7. Multiple tables that can be joined or merged on some Key variable

I recognise that a dataset with all of these (very specific) expectations might be difficult to come by. However, anything meeting at least some criteria would be greatly appreciated.

1 Answers1

2

I suggest adventure Works Cycles by Microsoft. This is a fabricated dataset for educational purposes and freely available. It has an OLTP database and a data warehouse version. Since Microsoft uses this dataset for teaching data mining in SSAS, you will not have trouble building scenarios for different algorithms.

Hamideh
  • 261
  • 1
  • 8