12

I'm looking for a way to get the longest common substring for multiple strings such as

{home/dir1/dir2/jmoasd.txt,home/dir1/dir2/ivbnoxcihv.txt,home/dir1/dir2/siuhgiuchv.txt}

should yield

home/dir1/dir2/

Is there a different way to do this other than the character-by-character comparison until unmatched or LongestCommonSubsequence[] (which only takes 2 elements)?

Thanks!

nqduy
  • 360
  • 1
  • 8
  • For your example the common sequences are the initial sequences, which is quite a lot easier to handle than the general question. Is common initial sequences all you need? – george2079 May 12 '16 at 20:08
  • In general, there is no the longest common substring, because there can be several of them. For example, for "AABB" and "BBAA", both "AA" and "BB" are longest common substrings. This also explains why in case of $>2$ strings you cannot just fold over the list of strings, finding LCS between only 2 strings at each step. – Vladimir Reshetnikov May 13 '16 at 21:46

1 Answers1

11

You can first compare two of the strings, get the longest common string, and then take the result and compare it to the third string. And keeping do it until the last string in the list will give you the longest common string for all the strings. This can be achieved using Fold, for example:

ls = {"home/dir1/dir2/jmoasd.txt", "home/dir1/dir2/ivbnoxcihv.txt", 
   "home/dir1/dir2/siuhgiuchv.txt"};
Fold[LongestCommonSubsequence, First@ls, Rest@ls]
(* "home/dir1/dir2/" *)

Edit

As JasonB pointed out, this method may fail when the sequence is not line up from the beginning in each string. In that case, one can use the method by Dr. belisarius:

longest[ls_] := 
 FromCharacterCode[(ToCharacterCode /@ 
     ls) /. {{___, Longest[y__], ___}, {___, y__, ___} ...} -> {y}]

ls // longest
(* "home/dir1/dir2/" *)

It also works for cases like:

{"aaaxxbbb", "bbbxxccc", "cccxxaaa"} // longest
(* "xx" *)
xslittlegrass
  • 27,549
  • 9
  • 97
  • 186
  • 3
    But what if the longest common sequence overall isn't the longest sequence between any two individual elements? – Jason B. May 12 '16 at 16:48
  • @JasonB I'm taking the result of the longest common string and comparing it to the next string. I think at the end, it should be the common longest string for all the elements. – xslittlegrass May 12 '16 at 16:59
  • 3
    I had this answer posted but deleted it. It works for OP's example, but it wouldn't work for {"aaaxxbbb", "bbbxxccc", "cccxxaaa"}. Because the longest common substring (LCS) of the three is not contained in the LCS of any pair. – Jason B. May 12 '16 at 17:47
  • @JasonB You are right. I didn't thought about that. I have updated my answer. – xslittlegrass May 12 '16 at 19:09
  • Great! I was trying to read the answers here, http://stackoverflow.com/q/5057243/4712538, but it's hard to read psuedocode after the workday is done and the kids are screaming lol. +1 – Jason B. May 12 '16 at 19:11
  • 2
    @JasonB wait!?! screaming kids doesn't enhance your ability to concentrate? :) – rcollyer May 12 '16 at 20:14
  • 2
    @rcollyer Other people tell how they can work from home.... I go to the office for peace and quiet – Jason B. May 12 '16 at 20:34
  • @JasonB I hear you. I telecommute and there's a toddler running around my house who occasionally comes to my door demanding attention. I suppose it could be worse. :) – rcollyer May 12 '16 at 20:41
  • Fold seems like the way to go. The majority of the data should have the common substring at the beginning. I can just deal with the anomalies afterwards. Thanks, everyone. – nqduy May 13 '16 at 01:34