Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string

Question

Please consider the situation where I give you a list where entries are drawn from a fixed set of characters:

 alphabet = {0,1,2};
 numElements = 10^3;
 bigString = StringJoin[Map[ToString, RandomChoice[alphabet, numElements]]];

I provide you two strings: string1 and string2. I'd like to count the number of instances where string2 occurs within a lower-bound and upper-bound "distance" of string1, and by "distance" I mean this in terms of the count for the number of characters in the gap region between string1 and string2 (i.e. the number of characters counting from immediately after the last character in string1 and the immediately before the first character in string2 if string1 occurs before string2, and vice versa if string2 occurs before string1). There may be multiple instances of string1 and string2, so in terms of overcounting, each instance of string2 should only be considered a single possible "hit" (if its within the lower- and upper-bound cutoff distance of string1).

Is there a built in function, or an easy way to do this?

As Leonid Shifrin requests, let's construct a small case example:

string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;

bigString = "000000111100012210111100000122100000001111000000000001221";

Now, in the above string, there are three instances of string2, so at most we can have an output count of 3. From left-to-right, here are all possible instances where string1 and string2 are separated by a gap of at least 3 characters and at most 10 characters:

[1] "11110001221"
[2] "1111000001221"
[3] "122100000001111"

Notice however that instances [2] and [3] correspond to the same instance of string2, so we only increase the count by 1 after seeing both of these instances. The final count is therefore 2.

To clarify a particular point, note that:

string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;

bigString = "11110001221001221";

Should give an output count of 2 considering that "11110001221" and "11110001221001221" (abstracted as "1111.........1221") represent instances of string1 and string2 within the lower- and upperbound gap specifications.

Please provide a small specific self-contained test case, like "these are my input strings, and this is what I want to get". Without it, you make us construct such a case, and most folks here just don't have the time for that. Make sure your test case is small enough to be easily grasped, and non-trivial enough to catch the cases of interest to you. This will significantly increase your chances to get good answers. — Leonid Shifrin, Apr 14 '14 at 17:43
The problem may be clear in your mind, but I find it completely confusing. What has string1 and string2 to do with your example code? How does bigString come into play? You never mention it your text description of the problem. — m_goldberg, Apr 14 '14 at 17:59
@m_goldberg You have my apologies, and also the just now posted example. Does this make it any clearer? — S22, Apr 14 '14 at 18:03
So string2 is special in the sense that if the values of string1 and string2 are exchanged, the resulting value of the count may be different? — m_goldberg, Apr 14 '14 at 18:13
In other words, it is not clear why you don't count the [2] and [3] as 2 counts in your first example (same instance of string2), but do count "11110001221" and "11110001221001221" as two separate counts later (same instance of string1). This looks inconsistent to me. — Leonid Shifrin, Apr 14 '14 at 18:15
@m_goldberg Yes string2 is special in the sense you suggest. Ultimately string2 will be the less common of the two substring types. The idea is to get a handle on the number of pairs {string1,string2} that are possible in bigString. — S22, Apr 14 '14 at 18:16
As I see it now, string2 is the target, the item of interest, while string1 is a landmark that only serves to define the gap. Is this correct? — m_goldberg, Apr 14 '14 at 18:21
@m_goldberg Yes, string1 is the landmark for string2. When we're done, there will be some "count" (hopefully) for the number of instances of string2 within the gap size of any landmark. — S22, Apr 14 '14 at 18:23
OK, I think it's clear enough now that @LeonidShifrin can solve it for us :) — m_goldberg, Apr 14 '14 at 18:26
@m_goldberg Well, if he does so he certainly has my gratitude. — S22, Apr 14 '14 at 18:31

score 4 · Accepted Answer · answered Apr 15 '14 at 05:05

The code below will identify "hits" using the requested rules. It works by computing the intersection of two sets of positions: 1) the positions of "string2" and 2) the allowable positions determined by leading and trailing "windows" relative to each occurrence of "string1". The count function will return the number of hits and position give the positions of the hits, if desired.

count[test_, floater_, anchor_, min_, max_] :=
    Length @ positions[test, floater, anchor, min, max]

positions[test_, floater_, anchor_, min_, max_] :=
  anchorPositions[test, anchor] ~Intersection~
  allowedPositions[test, floater, anchor, min, max]

anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

allowedPositions[testString_, floater_, anchor_, min_, max_] :=
  window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
  Union @@ # &

window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
  Flatten @ Apply[
    Range
  , floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
  , {1}
  ]

Explanation

For purposes of discussion, let's call the string that we wish to count the "anchor" string. We will use the term "floater" string to refer to the other string that must appear within a stipulated distance from an anchor. We will use the following test data from the question:

$floater= "1111";
$anchor = "1221";

$min = 3;
$max = 10;

$testString = "000000111100012210111100000122100000001111000000000001221";

We will start by defining a function that returns the start positions of the anchors within a test string.

anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

For example:

anchorPositions[$testString, $anchor]
(* {14, 28, 54} *)

Next, we define a function that will return the allowable positions for an anchor given that a particular floater is in a known position:

window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
  Flatten @ Apply[
    Range
  , floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
  , {1}
  ]

For example, if we know the position of a floater in a test string, then the allowable start positions for an anchor lie in two windows. One window is before the floater and the other after. If a floater extends between positions 20-23 and the distance band is 3-10, then an anchor of length 4 must fall in either a leading window covering positions 6-13 or a trailing window covering positions 27-34 (taking into account the min/max spacing and the length of the anchor string).

window[{20, 23}, 4, 3, 10]
(* {6, 7, 8, 9, 10, 11, 12, 13, 27, 28, 29, 30, 31, 32, 33, 34} *)

This helper function can be used to determine all allowable positions for anchors in a test string, being the union of the windows associated with all floater strings:

allowedPositions[testString_, floater_, anchor_, min_, max_] :=
  window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
  Union @@ # &

Example:

allowedPositions[$testString, $floater, $anchor, $min, $max]
(* {-7, -6, -5, -4, -3, -2, -1, 0, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15,
    16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 46, 47,
    48, 49, 50, 51, 52, 53} *)

Now we have everything we need to determine the positions of the allowed anchors within a test string. They are simply the intersection of two position sets: the set of anchor string and the set of allowable anchor positions:

positions[test_, floater_, anchor_, min_, max_] :=
  anchorPositions[test, anchor] ~Intersection~
  allowedPositions[test, floater, anchor, min, max]

Example:

positions[$testString, $floater, $anchor, $min, $max]
(* {14, 28} *)

Finally, the anchor count is the length of this position list:

count[test_, floater_, anchor_, min_, max_] :=
    Length @ positions[test, floater, anchor, min, max]

Example:

count[$testString, $floater, $anchor, $min, $max]
(* 2 *)

Here are the test cases from the question:

count[
  "000000111100012210111100000122100000001111000000000001221"
, "1111", "1221", 3, 10
]
(* 2 *)

count[
  "11110001221001221"
, "1111", "1221", 3, 10
]
(* 2 *)

Nice answer and explanation, and pretty quick. +1 – ciao Apr 15 '14 at 06:41 — ciao, Apr 15 '14 at 06:41

Leonid Shifrin · Answer 2 · 2014-04-14T19:57:20.197

This might be somewhat inefficient, but seems to work:

ClearAll[count];
count[        
    big_String,
    fstr_String,
    secstr_String,
    lb_Integer?Positive,
    ub_Integer?Positive
]:=
    Module[{pos,pairQ},            
        pairQ=
            Function[                    
                {fst,sec},
                Boole[                        
                    lb<=First[sec]-Last[fst]-1<=ub
                    ||
                    lb<=First[fst]-Last[sec]-1<=ub
                ]
            ];
        pos=(StringPosition[big,#1]&)/@{secstr,fstr};
        Total[Unitize[Total[Outer[pairQ,Sequence@@pos,1],{2}]]]
    ]

The basic idea is that, since string1 can have many pairs with string2, but string2 will always have at most 1 pair with some string1, we just have to add 1 for every string2 which can have at least one pair.

For example:

count[bigString, string1, string2, 3, 10]

(* 2 *)

count["11110001221001221", string1, string2, 3, 10]

(* 2 *)

The obvious place for optimization is the Outer function, since obviously not all pairs of string1 and string2 in the main string have to be considered.

This is quite nice, thank you for this. – S22 Apr 15 '14 at 09:20 — S22, Apr 15 '14 at 09:20

Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string

2 Answers2

Linked