I have a raw dataset under this form :
{{label1,sentence1},{label2,sentence2}.....}
Sample here: sample available here also
I am trying to apply TF-IDF: putting the number of times that a word appears in a sentence (term frequency) in relation to the number of times that that word appears in all other sentences (document frequency) or in other words counting the times a word appears on a given sentence but reducing its importance if it appears on many other sentences.
From now, for cleaning my data I use :
Data =
Import["/Users/tom/Desktop/Train_reduced.csv"]
label =
Data[[All, 1]];
sentences =
Flatten[Data[[All, 2]]];
cleanSentences =
StringReplace[sentences,{",", ".", "?", ":", "!", "-", "_", "|", "(", ")", "\"" , "{", "}",
";", "$", "!", "@", "#", "%", "^", "&", "/", "*", "[", "]", "=",
"'", DigitCharacter} -> " "];
cleanSentences2 =
ToLowerCase[DeleteStopwords[RemoveDiacritics[cleanSentences]]];
rule =
AssociationThread[label, cleanSentences2];
In order to associate all the labels number with the sentences linked
I want now to apply TFIDF to this variable rule and hence use this code:
TFIDF = FeatureExtraction[ Join[First /@ Keys@rule[[All]], Last /@ Keys@rule[[All]]], "TFIDF"]
but the output is telling me that there is nonatomic expression. As I wanted to apply the tf-idf for each sentence and for the total of sentences ... Any ideas of how to solve it or perform better ?

ruleData. Nobody wants to clean your dataset when you have done it already. – Henrik Schumacher Apr 08 '19 at 18:14