1

I would like to compare two human-language, user-supplied texts. Call them Text A and Text B. Each is the length of a fat novel, i.e., around 150,000 words.

Is there a tool that can help me find all exact phrase-matches, anywhere in the text, that are $n$ words long. (For example: Text A starts with "It was the Best of Times, it was the worst of Times; Text B has the phrase "worst of Times" buried somewhere. That comes up as a hit with the specification $n=3$, or $2≤n≤6$.)

Hey, thanks in advance to anyone who can give me pointers! I'm totally new to this, so sorry if this is elementary.

du_toit
  • 11
  • 2
  • Hi, you might find some help after I am just leaving you some references you might be interested in. I am not familiar with this area and do not know which methods you could use for the last two points, but at the bottom of the documentation of the Nearest page there are a lot of guides on topics like Text Analysis, Natural Language Processing, Linguistic data, etc. In the home page of the documentation there are different categories, some that might interest you. – userrandrand Oct 18 '22 at 03:06
  • Welcome to the Mathematica Stack Exchange. Please visit SequenceAlignment documentation pages. The question, as posed, is asking many questions and so the likelihood of getting an answer reduces. Thanks. – Syed Oct 18 '22 at 06:57
  • Thank you, Syed. I have edited the question to narrow it down to the first item. And I'm off to learn about SequenceAlignment. Any other tips are appreciated. – du_toit Oct 18 '22 at 09:46
  • If I was going to attempt this I'd probably start with TextStructure, using the optional second argument "ConstituentStrings", to make a list of the constituent phrases in each text. Then I'd sort them, compare the two sorted lists. Then I'd do some serious thinking ... – High Performance Mark Oct 18 '22 at 10:31
  • If you just want to find matches for a string in another string you can use StringCases like StringCases["hat cat dog hat cat", "hat cat"] – userrandrand Nov 29 '22 at 01:02
  • If you want something more sophisticated maybe the answer here would help – userrandrand Nov 29 '22 at 01:05

0 Answers0