TF-IDF matrix question

Question

I am looking to weight the words (TF-IDF) of a random text by his occurence and showing that on a matrix. I saw there is a project on it but would like to know if it possible to change the visualization ? https://demonstrations.wolfram.com/TermWeightingWithTFIDF/

Example of the final output

It’s very likely that you could. But what would you like to change it to? What is your code so far? — MarcoB, Apr 07 '19 at 12:52
So far I only clean the data I have by removing stopwords and punctuation... The next step I would have is to :

separate my dataset (one big text) into subsets representing : {sentence 1}, {sentence 2...}.... maybe by attributing for each sentence an ID ?

taking a unique list of all the words in the text

Taking each sentence and count for each word if the word appear in the sentence

Put the result under a table like above — Tom Peterson, Apr 07 '19 at 13:26
Welcome to Mathematica.SE! I suggest the following: 1) As you receive help, try to give it too, by answering questions in your area of expertise. 2) Take the tour! 3) When you see good questions and answers, vote them up by clicking the gray triangles, because the credibility of the system is based on the reputation gained by users sharing their knowledge. Also, please remember to accept the answer, if any, that solves your problem, by clicking the checkmark sign! — Michael E2, Apr 08 '19 at 15:29
Possible duplicate of Words weighting with TF-IDF (updated with cleaned code) — Pinti, Apr 12 '19 at 11:48

bill s · Answer 1 · 2019-04-07T15:32:08.837

1

"I am looking to weight the words (TF-IDF) of a random text by his occurrence and showing that on a matrix."

Weighting words by frequency of occurrence is the same as normalizing the columns of the matrix. To do this, let data be you term/document matrix, and sum the columns by using Total. The final line divides each column by the appropriate sum.

data = RandomInteger[{0, 1}, {5, 15}];
norm = Total[data] /. {0 -> 1};
tab = Transpose[Table[Transpose[data][[i]]/norm[[i]], {i, Length[norm]}]];
TableForm[tab]

You'll have to decide what you want it to look like.

edited Apr 07 '19 at 15:32

answered Apr 07 '19 at 15:22

bill s

68,936
4
101
191

I come up with this code by now :
TFIDF = FeatureExtraction[ Join[First /@ Keys@ruleData[[All]], Last /@ Keys@ruleData[[All]]], "TFIDF"]

but the output is telling me that there is nonatomic expression my variable rule above is an association between label and sentences ( 1--> sentence 1, 2--> sentence2, 3--> sentence3...)

As I wanted to apply the tf-idf for each sentence and for the total of sentences
– Tom Peterson Apr 08 '19 at 13:32

score 0 · Answer 2 · answered Apr 08 '19 at 13:35

I come up with this code by now :

TFIDF = FeatureExtraction[ Join[First /@ Keys@ruleData[[All]], Last /@ Keys@ruleData[[All]]], "TFIDF"]

but the output is telling me that there is nonatomic expression

my variable rule above is an association between label and sentences ( 1--> sentence 1, 2--> sentence2, 3--> sentence3...) As I wanted to apply the tf-idf for each sentence and for the total of sentences

score 0 · Accepted Answer · answered Apr 11 '19 at 13:20

From the comments of the question:

So far I only clean the data I have by removing stopwords and punctuation... The next step I would have is to : 1. separate my dataset (one big text) into subsets representing : {sentence 1}, {sentence 2...}.... maybe by attributing for each sentence an ID ? 2. taking a unique list of all the words in the text 3. Taking each sentence and count for each word if the word appear in the sentence 4. Put the result under a table like above

This process is more or less followed in the blog post: "The Great conversation in USA presidential speeches".

The crucial step is making the document-word contingency matrix. Generally speaking, for this you can use the function CrossTabulate from the package CrossTabulate.m. For more details see this blog post: "Contingency tables creation examples".
The TF-IDF and related measures can be computed with the package DocumentTermMatrixConstruction.m.

(These packages are used in the blog post "The Great conversation in USA presidential speeches".)

TF-IDF matrix question

3 Answers3