1

I have a raw dataset under this form :

{{label1,sentence1},{label2,sentence2}.....}

Sample here: sample available here also

sample here

I am trying to apply TF-IDF: putting the number of times that a word appears in a sentence (term frequency) in relation to the number of times that that word appears in all other sentences (document frequency) or in other words counting the times a word appears on a given sentence but reducing its importance if it appears on many other sentences.

From now, for cleaning my data I use :

Data = 
Import["/Users/tom/Desktop/Train_reduced.csv"]

label =
Data[[All, 1]];

sentences = 
Flatten[Data[[All, 2]]];

cleanSentences = 
StringReplace[sentences,{",", ".", "?", ":", "!", "-", "_", "|", "(",         ")", "\"" , "{", "}", 
   ";", "$", "!", "@", "#", "%", "^", "&", "/", "*", "[", "]", "=", 
   "'", DigitCharacter} -> " "];

cleanSentences2 = 
ToLowerCase[DeleteStopwords[RemoveDiacritics[cleanSentences]]];

rule = 
AssociationThread[label, cleanSentences2];

In order to associate all the labels number with the sentences linked

I want now to apply TFIDF to this variable rule and hence use this code:

TFIDF = FeatureExtraction[ Join[First /@ Keys@rule[[All]], Last /@ Keys@rule[[All]]], "TFIDF"]

but the output is telling me that there is nonatomic expression. As I wanted to apply the tf-idf for each sentence and for the total of sentences ... Any ideas of how to solve it or perform better ?

  • Tom, I am afraid we cannot help you if you do not provide us with an exact definition of ruleData. Nobody wants to clean your dataset when you have done it already. – Henrik Schumacher Apr 08 '19 at 18:14
  • 1
    Thank you Henrik for your advices, I uploaded the current code for cleaning my dataset – Tom Peterson Apr 08 '19 at 18:31

0 Answers0