1

I have the following problem.

I want to get a snippet from a text that contains a number, which I am interested in:

Sentence:

{"My phone number is 1234567 and my social security number is 98765433."}

where I know that I want the following snippets for {1234567, 98765433}:

{"phone number 1234567","social security number 98765433}

(optimally, "is" would not be included)

Unfortunately, I have no clue how to tackle this problem... Could anyone help me with this?

Thank you very much!

Somos
  • 4,897
  • 1
  • 9
  • 15
Coffee_09
  • 177
  • 6
  • Lots of ways to approach the problem, depending on what you know about the input data. You should read the help on string manipulation: https://reference.wolfram.com/language/guide/StringOperations.html – dionys Apr 24 '19 at 11:46

1 Answers1

5

Define

text="My phone number is 1234567 and my social security number is 98765433.";

A possible one-liner for this:

StringRiffle[DeleteStopwords[#]]&/@Fold[TextCases,text,{"Clause","NounPhrase"}]

{" phone number 1234567"," social security number 98765433"}

There are many ways to go around this, depending on your text corpus structure and generality of the solution you need. To get you started, consider the following articles in docs:

Using TextCases you can get very flexible results, but you need to read these doc articles above to build a general application that suits your purpose. Here are some simple starters. You can see TextStructure to understand elements you want from grammatical point of view:

TextStructure[text]

enter image description here

Looks Noun Phrase is relevant and you can try to get it cleaning up a bit after:

cases=DeleteStopwords[TextCases[text, "NounPhrase"]]

{" phone number","1234567"," social security number","98765433"}

Now for formatting also many choices:

TextGrid[Partition[cases, 2]]

enter image description here

or what you wanted exactly:

StringRiffle /@ Partition[cases, 2]

{"phone number 1234567","social security number 98765433"}

But a more elegant solution is given at the top with TextCases addressing also Clause.

Vitaliy Kaurov
  • 73,078
  • 9
  • 204
  • 355