4

I have the following file:

Fulltext = {"Apple Hospitality REIT, Inc. (APLE) Market Cap: $4.04B", "Arena Pharmaceuticals, Inc. (ARNA) Market Cap: $749.57M ", "Argo Group International Holdings, Ltd. (AGII) Market Cap:$1.81B ", "Armstrong Flooring, Inc. (AFI) Market Cap: $475.81M ", "Atlas Financial Holdings, Inc. (AFH) Market Cap: $183.47M ", "Avis Budget Group, Inc. (CAR) Market Cap: $2.6B "};

I just want to extract the following:

WhatIwant = {APLE, ARNA, AGII, AFI, AFH, CAR};

Although this example has only six extractions, I am looking for a more generic function that can be used to extract any number of strings that are within parentheses. Any help is greatly appreciated.

For clarity: I am importing a text file from a website using the following codes:

importeddata = 
  Import["http://www.nasdaq.com/earnings/earnings-calendar.aspx?date=\
2017-Aug-07", "Data"];
calanderdate = Flatten[importeddata[[2, 1]], 2];
calanderdate2 = calanderdate[[4 ;; Length[calanderdate] - 1]];

The calanderdate2 has a lot of companies' names and their ticker symbols in the middle within parentheses. I want to extract only the ticker symbols. Thanks

ramesh
  • 2,309
  • 16
  • 29
  • Try something like StringCases[str, "("~~s:Except["("|")"]..~~")":>s] – b3m2a1 Aug 06 '17 at 05:37
  • b3m2a1, I think it serves my need. If this is not a dump question or is not repeated, can you please answer this question? Thank you for the prompt reply. – ramesh Aug 06 '17 at 05:46
  • To be sure you should also use Shortest, and if I remember correctly, the most efficient way uses RegularExpression. – JEM_Mosig Aug 06 '17 at 05:50
  • @JEM_Mosig Shortest isn't actually necessary because of both of the Except cases and a raw Shortest fails. I'm not surprised RegularExpression wins though. Post an answer with that and I'll upvote it. – b3m2a1 Aug 06 '17 at 05:51

4 Answers4

6

Here is a timing comparison of equivalent solutions (Mathematica 11.1.1 on Windows 7 x64):

str = StringRepeat["asdasd (((( ) asd) aasdasd)  (a ) asd) (asd) )asdasd )", 1000000];

StringCases[str, "(" ~~ s : RegularExpression["[^()]*"] ~~ ")" :> s]; // AbsoluteTiming

(* => {2.51031, Null} *)

StringCases[str, "(" ~~ s : Except["(" | ")"] ... ~~ ")" :> s]; // AbsoluteTiming

(* => {2.70701, Null} *)

StringCases[str, "(" ~~ s : RegularExpression["[^()]"] ... ~~ ")" :> s]; // AbsoluteTiming

(* => {3.08826, Null} *)

StringCases[str, RegularExpression["\\(([^()]*)\\)"] -> "$1"]; // AbsoluteTiming

(* => {5.29573, Null} *)

StringCases[str, RegularExpression["(?<=\\()[^()]*(?=\\))"]]; // AbsoluteTiming

(* => {7.71658, Null} *)

It is surprising that substring extraction via numbered capturing group (RegularExpression["\\(([^()]*)\\)"] -> "$1") is so inefficient as compared to equivalent substring extration via named capturing group ("(" ~~ s : RegularExpression["[^()]*"] ~~ ")" :> s).

If you don't want to get empty string "" from "()" you should replace * with + in the regexes:

StringCases["()", "(" ~~ s : RegularExpression["[^()]+"] ~~ ")" :> s]

(* => {} *)
Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
4

Here's another option:

StringCases["asdasd (((( ) asd) aasdasd)  (a ) asd) (asd) )asdasd )", 
 "(" ~~ s : Except["(" | ")"] .. ~~ ")" :> s]

{" ", "a ", "asd"}

Note why both of the Alternatives in the Except are needed:

StringCases["asdasd (((( ) asd) aasdasd)  (a ) asd) (asd) )asdasd )", 
 "(" ~~ s : Except["("] .. ~~ ")" :> s]

{" ) asd) aasdasd", "a ) asd", "asd) )asdasd "}

StringCases["asdasd (((( ) asd) aasdasd)  (a ) asd) (asd) )asdasd )", 
 "(" ~~ s : Except[")"] .. ~~ ")" :> s]

{"((( ", "a ", "asd"}
b3m2a1
  • 46,870
  • 3
  • 92
  • 239
  • Maybe you find Shortest opaque because of this. Once you have understood that Shortestis "oriented" (ie not balanced, not symetrical), you feel far better. – andre314 Aug 06 '17 at 06:23
  • @andre it's mostly that a basic Shortest[__] does not always return what one expects. – b3m2a1 Aug 06 '17 at 06:25
  • 1
    andre is right: Shortest is translated into the corresponding lasy quantifier (*? or +?) what explains the observed "unexpected" behaviour. There was a dedicated question on the difference in behaviour of Shortest in usual patterns and string patterns. – Alexey Popkov Aug 06 '17 at 07:23
3

A possibilty is to use Shortest:

StringCases["asdasd (((( ) asd) aasdasd) (a ) asd) (asd) )asdasd )",Shortest[ "("~~s:Except["("]..~~")"]:>s]

{" ", "a ", "asd"}

Though I'm not aware of any advantages too use this form (with Shortest) rather than @b3m2a1's solution (with Except[Alternative[...]])

andre314
  • 18,474
  • 1
  • 36
  • 69
3

I had a similar problem when looking for citations in text and found this article here useful Shortest and string patterns. I for myself use the following function to do the job:

midString[text_String,{leftDelimiter_,rightDelimiter_}]:=
Union @ 
Flatten @ 
StringCases[text,leftDelimiter~~Shortest[x__]~~rightDelimiter->x, Overlaps->False]

It also can be used to extract filenames form a name containing albs the directory name. In your case:

    midString[#, {"(", ")"}] & /@ Fulltext // Flatten
(* {"APLE","ARNA","AGII","AFI","AFH","CAR"} *)

To overcome the problem mentioned in the linked thread you can use a more sophisticated version:

citations[text_String, {leftDelimiter_, rightDelimiter_}] :=
 StringCases[
  text, (leftDelimiter ~~ mid___ ~~ rightDelimiter) /; 
   StringFreeQ[mid, {leftDelimiter, rightDelimiter}]]
mgamer
  • 5,593
  • 18
  • 26