How to split a Dataset into training and testing for machine learning?

Question

Suppose I have iris data In the Dataset Format. enter image description here

How to split the iris data Into training and testing for Machine learning?

For example,Transform the data for Classify.

Rojo · Answer 1 · 2014-08-13T07:23:51.910

16

Let's use the Titanic dataset

tit = ExampleData[{"Dataset", "Titanic"}];

Let's see it's columns

tit[Union, Keys]

Let's choose an objective

obj = "survived";

Let's add an id to each row

tit = tit[AssociationThread[Range@Length@#, #] &];

Let's create a database that splits the features and objective, to make this general.

titSplit = 
  tit[All, <|"Features" -> Values@*KeyDrop[obj], 
    "Objective" -> Key[obj]|>];

Let's split the rows arbitrarily to have a test and train set.

numTraining = 200;
ids = Range@tit[Length];
testIds = ids~RandomSample~numTraining;
trainIds = ids~Complement~testIds;

titUnclass = titSplit[<|"Test" -> testIds, "Train" -> trainIds|>];

Let's train the classifier

cfun = titUnclass["Train", Classify@*Values, #Features -> #Objective &];

Let's create a new dataset appending the classifications as a column in the Test dataset.

titClass = 
  titUnclass[{"Test" -> 
     Query[All, Append[#, <|"Classified as" -> cfun@#Features|>] &]}];

Or perhaps just have the results in a separate database results = titUnclass["Test", All, cfun@#Features &];

Let's ask for performance measures

cfm = titUnclass["Test", 
   ClassifierMeasurements[cfun, #] & @* Values, 
   #Features -> #Objective &];

cfm["Accuracy"]

(* 0.75 *)

Edit

This is based on the OP @PhilChang nice suggestion in the comments

RandomSample works on Datasets

rtit = RandomSample@ExampleData[{"Dataset", "Titanic"}];

Split

obj = "survived"; numTrain = 200;
ctit = rtit[<|"Train" -> (;; numTrain), "Test" -> (numTrain + 1 ;;)|>];

Train

cfun = ctit["Train", GroupBy[Key[obj] -> KeyDrop[obj]], Values][
   Classify];

Measure

cfm = ctit["Test", GroupBy[Key[obj] -> KeyDrop[obj]], Values][
   ClassifierMeasurements[cfun, #] &];
cfm["Accuracy"]

edited Aug 13 '14 at 07:23

answered Aug 12 '14 at 19:05

Rojo

42,601
7
96
188

2

I get an Accuracy = 0.63. Naive splitting like this is often biased by the dataset layout, eg, look at the distribution of "class": trainTit[Map[First] /* Counts, "Features"] --> "1st" -> 200 versus testTit[Map[First]...] --> "1st" -> 123, "2nd" -> 277, "3rd" -> 709. Typically, bootstrap or other resampling methods are used to minimize the effects of clustering. Certainly you'd want to randomize the partition. – alancalvitti Aug 12 '14 at 19:49
@alancalvitti I struggled with posting it RandomSampled or not, but I'll edit now. I'll also put both trainTit and testTit inside another database. – Rojo Aug 12 '14 at 19:58
8

We need to make this easier, probably with a new function. Especially making sure you preserve the frequencies of whatever classes when you do the split. – Taliesin Beynon Aug 12 '14 at 20:30
1

Thank you all,I learned a lot from it. For this moment the best way maybe: 1. Random the dataset, randsample = RandomSample @ titanic; 2. get train and test data. train = RandSample[;;3000];test = randomSample[3001;;]; 3. transform the data set for classify use. train = titanic[GroupBy[Key[obj]], All, Values@Normal@KeyDrop[obj]]; – PhilChang Aug 13 '14 at 06:31
3

If Only Classify accept a Dataset as its argument:Classify[mydataset,y,{x1,x2,...,xp}] – PhilChang Aug 13 '14 at 06:36
@PhilChang nice, I hadn't realised RandomSample worked with Dataset arguments. I also like your GroupBy obj approach better, since Classify can also take that format as argument! I'll edit – Rojo Aug 13 '14 at 07:07
Functions for working with Datasets to build test sets for statistics/sampling in general would be very useful. This is quite complex for most cases where you have pre-labeled data. – Nov 09 '14 at 03:19
@TaliesinBeynon given Etienne's comments about Classify performing some internal cross validation in the answer below - isn't there a risk that by performing the splitting (and potentially folding} prior to using Classify that we just do a bad version of something that is already "built in"? http://mathematica.stackexchange.com/questions/67715/how-to-know-the-internal-algorithms-of-functions-like-predict-or-classify – Gordon Coale Feb 04 '15 at 09:45
1

@GordonCoale You can use the ValidationSet option to Classify and Predict to override our internal cross-validation if you have your own test set. Also, @Rojo, note that in 10.0.2 you can use the Classify[data -> out] shorthand to indicate that the column name or number is the one being predicted, so you don't have to split off the features from the output yourself. – Taliesin Beynon Feb 09 '15 at 18:50
@TaliesinBeynon that's neat, thanks! Nice to see you back around – Rojo Feb 09 '15 at 20:16
The code above does not work for me as of Mathematica 10.0.2 (WIndows 64Bit) - has something changed? The line titSplit = tit[All, <|"Features" -> Values@*KeyDrop[obj], "Objective" -> Key[obj]|>] will cause the errors "The argument Identity is not a valid Association or a list of rules." and ""The argument !(Values[(KeyDrop["survived"])[Identity]]) is
not a valid Association or a list of rules." ? Also the EDIT-section does not work as ctit = rtit[<|"Train" -> (;; numTrain), "Test" -> (numTrain + 1 ;;)|>]will cause a Transpose-error. – gwr Mar 18 '15 at 11:46
How come after all these years, there's no built-in function for such a fundamental machine learning operation? – stathisk May 15 '22 at 20:14

score 1 · Answer 2 · answered Jan 29 '23 at 22:34

1

I think there is a resource function now...

https://resources.wolframcloud.com/FunctionRepository/resources/TrainTestSplit/

answered Jan 29 '23 at 22:34

Pat Brooks

11
2

How to split a Dataset into training and testing for machine learning?

2 Answers2

Linked