ansaurus

Question

Find the longest common starting substring in a set of strings

Answer 1

+2 A:

Python 2.6 (r26:66714, Oct  4 2008, 02:48:43) 

>>> a = ['interspecies', 'interstelar', 'interstate']

>>> print a[0][:max(
        [i for i in range(min(map(len, a))) 
            if len(set(map(lambda e: e[i], a))) == 1]
        ) + 1]

inters

i for i in range(min(map(len, a))), number of maximum lookups can't be greater than the length of the shortest string; in this example this would evaluate to [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
len(set(map(lambda e: e[i], a))), 1) create an array of the i-thcharacter for each string in the list; 2) make a set out of it; 3) determine the size of the set
[i for i in range(min(map(len, a))) if len(set(map(lambda e: e[i], a))) == 1], include just the characters, for which the size of the set is 1 (all characters at that position were the same ..); here it would evaluate to [0, 1, 2, 3, 4, 5]
finally take the max, add one, and get the substring ...

Note: the above does not work for a = ['intersyate', 'intersxate', 'interstate', 'intersrate'], but this would:

 >>> index = len(
         filter(lambda l: l[0] == l[1], 
             [ x for x in enumerate(
                 [i for i in range(min(map(len, a))) 
                     if len(set(map(lambda e: e[i], a))) == 1]
         )]))
 >>> a[0][:index]
 inters

The MYYN 2009-12-16 17:43:18

Neat. Checking the sizes of vertical sets is smart

mislav 2009-12-16 17:55:30

Answer 2

+1 A:

Here's a solution using regular expressions in Ruby:

def build_regex(string)
  arr = []
  arr << string.dup while string.chop!
  Regexp.new("^(#{arr.join("|")})")
end

def substring(first, *strings)
  strings.inject(first) do |accum, string|
    build_regex(accum).match(string)[0]
  end
end

Yehuda Katz 2009-12-16 17:47:39

I'm interested in how it performs. I can't make up my mind if it would be faster or slower than the reference solution(s). But I know you're already writing benchmarks ;) *waits patiently*

mislav 2009-12-16 17:56:55

You don't need regular expressions, because all you need is linear string comparison. This implementation is particularly inefficient because you build a new regex in each iteration.

Svante 2009-12-16 19:52:05

Answer 3

+1 A:

This is probably not the most concise solution (depends if you already have a library for this), but one elegant method is to use a trie. I use tries for implementing tab completion in my Scheme interpreter:

http://github.com/jcoglan/heist/blob/master/lib/trie.rb

For example:

tree = Trie.new
%w[interspecies interstelar interstate].each { |s| tree[s] = true }
tree.longest_prefix('')
#=> "inters"

I also use them for matching channel names with wildcards for the Bayeux protocol; see these:

http://github.com/jcoglan/faye/blob/master/client/channel.js

http://github.com/jcoglan/faye/blob/master/lib/faye/channel.rb

jcoglan 2009-12-16 17:52:02

You don't need a trie. As soon as any branch would occur, you're done anyway.

Svante 2009-12-16 19:03:20

Really interesting—thanks for sharing!

mislav 2009-12-17 00:00:08

Answer 4

+2 A:

Doesn't seem that complicated if you're not too concerned about ultimate performance:

def common_substring(data)
  data.inject { |m, s| s[0,(0..m.length).find { |i| m[i] != s[i] }.to_i] }
end

One of the useful features of inject is the ability to pre-seed with the first element of the array being interated over. This avoids the nil memo check.

puts common_substring(%w[ interspecies interstelar interstate ]).inspect
# => "inters"
puts common_substring(%w[ feet feel feeble ]).inspect
# => "fee"
puts common_substring(%w[ fine firkin fail ]).inspect
# => "f"
puts common_substring(%w[ alpha bravo charlie ]).inspect
# => ""
puts common_substring(%w[ fork ]).inspect
# => "fork"
puts common_substring(%w[ fork forks ]).inspect
# => "fork"

Update: If golf is the game here, then 67 characters:

def f(d)d.inject{|m,s|s[0,(0..m.size).find{|i|m[i]!=s[i]}.to_i]}end

tadman 2009-12-16 17:58:22

No, this isn't golf. I actually had to look the phrase up to know what it means :) By "elegant code" I meant code that is readable, understandable and clever. Removing whitespace and shortening method/variable names is opposite from readable and understandable, and does not make the solution any more clever.Thanks for the `inject` tip! Wasn't aware of that.

mislav 2009-12-16 23:38:41

I know what you're saying about golfing, but couldn't let Jordan's effort go unchallenged.

tadman 2009-12-17 17:40:49

Answer 5

A:

This is by no means elegant, but if you want concise:

Ruby, 71 chars

def f(a)b=a[0];b[0,(0..b.size).find{|n|a.any?{|i|i[0,n]!=b[0,n]}}-1]end

If you want that unrolled it looks like this:

def f(words)
  first_word = words[0];
  first_word[0, (0..(first_word.size)).find { |num_chars|
    words.any? { |word| word[0, num_chars] != first_word[0, num_chars] }
  } - 1]
end

Jordan 2009-12-16 17:59:28

Apparently this doesn't work on single-element arrays, but otherwise is pretty compact.

tadman 2009-12-16 18:43:24

Add the line `return first_word if words.length==1` as the second line of the function and it should handle single-element arrays as well.

bta 2009-12-16 18:48:07

Answer 6

+7 A:

Ruby one-liner:

l=strings.inject{|l,s| l=l.chop while l!=s[0...l.length];l}

AShelly 2009-12-16 18:00:44

This traverses each string once for each character not in the result. Not very efficient.

Svante 2009-12-16 20:26:26

It might need more iterations than other solutions, but I like it. It's what I consider elegant.

mislav 2009-12-16 23:32:13

Answer 7

+3 A:

You just need to traverse all strings until they differ, then take the substring up to this point.

Pseudocode:

loop for i upfrom 0
     while all strings[i] are equal
     finally return substring[0..i]

Common Lisp:

(defun longest-common-starting-substring (&rest strings)
  (loop for i from 0 below (apply #'min (mapcar #'length strings))
     while (apply #'char=
                  (mapcar (lambda (string) (aref string i))
                          strings))
     finally (return (subseq (first strings) 0 i))))

Svante 2009-12-16 18:02:20

+1, this is the optimal solution

MAK 2009-12-16 21:00:14

Answer 8

+1 A:

I would do the following:

Take the first string of the array as the initial starting substring.
Take the next string of the array and compare the characters until the end of one of the strings is reached or a mismatch is found. If a mismatch is found, reduce starting substring to the length where the mismatch was found.
Repeat step 2 until all strings have been tested.

Here’s a JavaScript implementation:

var array = ["interspecies", "interstelar", "interstate"],
    prefix = array[0],
    len = prefix.length;
for (i=1; i<array.length; i++) {
    for (j=0, len=Math.min(len,array[j].length); j<len; j++) {
        if (prefix[j] != array[i][j]) {
            len = j;
            prefix = prefix.substr(0, len);
            break;
        }
    }
}

Gumbo 2009-12-16 18:18:08

This is inefficient because you likely make comparisons that are superfluous when a later string further shortens the result.

Svante 2009-12-16 19:48:48

Answer 9

+12 A:

In Python:

>>> from os.path import commonprefix
>>> commonprefix('interspecies interstelar interstate'.split())
'inters'

Roberto Bonvallet 2009-12-16 18:22:48

Can't believe it. Python has this in stdlib?

mislav 2009-12-16 23:27:26

Yes, but it's hidden in the `os.path` module, although it is a string manipulation function useful.

Roberto Bonvallet 2009-12-17 00:08:54

Answer 10

A:

In Python I wouldn't use anything but the existing commonprefix function I showed in another answer, but I couldn't help to reinvent the wheel :P. This is my iterator-based approach:

>>> a = 'interspecies interstelar interstate'.split()
>>>
>>> from itertools import takewhile, chain, izip as zip, imap as map
>>> ''.join(chain(*takewhile(lambda s: len(s) == 1, map(set, zip(*a)))))
'inters'

Edit: Explanation of how this works.

zip generates tuples of elements taking one of each item of a at a time:

In [6]: list(zip(*a))  # here I use list() to expand the iterator
Out[6]:
[('i', 'i', 'i'),
 ('n', 'n', 'n'),
 ('t', 't', 't'),
 ('e', 'e', 'e'),
 ('r', 'r', 'r'),
 ('s', 's', 's'),
 ('p', 't', 't'),
 ('e', 'e', 'a'),
 ('c', 'l', 't'),
 ('i', 'a', 'e')]

By mapping set over these items, I get a series of unique letters:

In [7]: list(map(set, _))  # _ means the result of the last statement above
Out[7]:
[set(['i']),
 set(['n']),
 set(['t']),
 set(['e']),
 set(['r']),
 set(['s']),
 set(['p', 't']),
 set(['a', 'e']),
 set(['c', 'l', 't']),
 set(['a', 'e', 'i'])]

takewhile(predicate, items) takes elements from this while the predicate is True; in this particular case, when the sets have one element, i.e. all the words have the same letter at that position:

In [8]: list(takewhile(lambda s: len(s) == 1, _))
Out[8]:
[set(['i']),
 set(['n']), 
 set(['t']), 
 set(['e']), 
 set(['r']), 
 set(['s'])]

At this point we have an iterable of sets, each containing one letter of the prefix we were looking for. To construct the string, we chain them into a single iterable, from which we get the letters to join into the final string.

The magic of using iterators is that all items are generated on demand, so when takewhile stops asking for items, the zipping stops at that point and no unnecessary work is done. Each function call in my one-liner has a implicit for and an implicit break.

Roberto Bonvallet 2009-12-16 19:02:15

Could you decompose the solution and elaborate a bit on iterators used?

mislav 2009-12-17 00:02:56

Gladly. There it is :)

Roberto Bonvallet 2009-12-17 00:29:13

why "izip" and "imap"? at first thought it could be that they are case-insensitive, but then I realized that doesn't make much sense

mislav 2009-12-17 09:46:38

`izip` and `imap` return iterators, while `zip` and `map` return lists. By using lists, all tuples are generated, even though `takewhile` will never see most of them. In Python 3 `zip` and `map` return iterators, so it's not necessary to import them from `itertools`.

Roberto Bonvallet 2009-12-17 15:16:48

Answer 11

A:

Javascript clone of AShelly's excellent answer.

Requires Array#reduce which is supported only in firefox.

var strings = ["interspecies", "intermediate", "interrogation"]
var sub = strings.reduce(function(l,r) { 
    while(l!=r.slice(0,l.length)) {  
        l = l.slice(0, -1);
    }
    return l;
});

Chetan Sastry 2009-12-16 19:09:30

This traverses each string once for every character not in the result; e.g., a string that is the result plus 10 characters gets traversed 10 times. This is not very efficient.

Svante 2009-12-16 20:13:45

Answer 12

+11 A:

It's a matter of taste, but this is a simple javascript version: It sorts the array, and then looks just at the first and last items.

function sharedStart(A){
    var tem1, tem2, s, A= A.slice(0).sort();
    tem1= A[0];
    s= tem1.length;
    tem2= A.pop();
    while(s && tem2.indexOf(tem1)== -1){
     tem1= tem1.substring(0, --s);

    }
    return tem1;
}

var A= ['interspecies', 'interstelar', 'interstate'];

alert(sharedStart(A))

kennebec 2009-12-16 19:26:12

Wow. This blew me away

mislav 2009-12-16 23:43:18

I had another thought, if you enter a one element array, the return should logically be the whole string- it does match itself. Edited in place, replaced tem1=A.shift() with tem1=A[0].

kennebec 2009-12-17 05:36:01

Note that sorting is done in O(n·log n).

Gumbo 2009-12-19 17:51:40

I would upvote this more than once if I could. Really nice.

twk 2010-03-15 17:28:02

Answer 13

A:

It's not code golf, but you asked for somewhat elegant, and I tend to think recursion is fun. Java.

/** Recursively find the common prefix. */
public String findCommonPrefix(String[] strings) {

    int minLength = findMinLength(strings);

    if (isFirstCharacterSame(strings)) {
     return strings[0].charAt(0) + findCommonPrefix(removeFirstCharacter(strings));
    } else {
     return "";
    }
}

/** Get the minimum length of a string in strings[]. */
private int findMinLength(final String[] strings) {
    int length = strings[0].size();
    for (String string : strings) {
     if (string.size() < length) {
      length = string.size();
     }
    }
    return length;
}

/** Compare the first character of all strings. */
private boolean isFirstCharacterSame(String[] strings) {
    char c = string[0].charAt(0);
    for (String string : strings) {
     if (c != string.charAt(0)) return false;
    }

    return true;
}

/** Remove the first character of each string in the array, 
    and return a new array with the results. */
private String[] removeFirstCharacter(String[] source) {
    String[] result = new String[source.length];
    for (int i=0; i<result.length; i++) {
     result[i] = source[i].substring(1); 
    }
    return result;
}

Dean J 2009-12-16 19:31:00

I will not condemn recursion, but you are building a new string for concatenating each character of the result. Otherwise, the approach is ok.

Svante 2009-12-16 20:47:10

Answer 14

+6 A:

My Haskell one-liner:

import Data.List

commonPre :: [String] -> String
commonPre = map head . takeWhile (\(x:xs)-> all (==x) xs) . transpose

EDIT: barkmadley gave a good explanation of the code below. I'd also add that haskell uses lazy evaluation, so we can be lazy about our use of transpose; it will only transpose our lists as far as necessary to find the end of the common prefix.

jberryman 2009-12-16 23:03:26

This looks like very nice code! Can you take a moment to explain it a bit to us who don't know the language? (You don't have to go to large depths)

mislav 2009-12-16 23:47:53

first you transpose the strings. `["abc","abd","aed"] -> ["aaa","bbe","cdd"]`, then you take from this list until something differs `["aaa"]` and then you take the first element of each list in this list (strings are just lists in Haskell) which gets you `"a"`. it's easier to read from right to left, because that is the way that data travels in Haskell's point free style.

barkmadley 2009-12-17 05:49:56

Answer 15

A:

A ruby version based on @Svante's algorithm. Runs ~3x as fast as my first one.

 def common_prefix set 
   i=0
   rest=set[1..-1]
   set[0].each_byte{|c|
     rest.each{|e|return set[0][0...i] if e[i]!=c}
     i+=1
   }
   set
 end

AShelly 2009-12-17 04:19:25

Answer 16

+1 A:

This one is very similar to Roberto Bonvallet's solution, except in ruby.

chars = %w[interspecies interstelar interstate].map {|w| w.split('') }
chars[0].zip(*chars[1..-1]).map { |c| c.uniq }.take_while { |c| c.size == 1 }.join

The first line replaces each word with an array of chars. Next, I use zip to create this data structure:

[["i", "i", "i"], ["n", "n", "n"], ["t", "t", "t"], ...

map and uniq reduce this to [["i"],["n"],["t"], ...

take_while pulls the chars off the array until it finds one where the size isn't one (meaning not all chars were the same). Finally, I join them back together.

Ben Marini 2009-12-17 04:37:26

Didn't know ruby had a take_while—thanks!

mislav 2009-12-17 09:43:08

Answer 17

A:

Just for the fun of it, here's a version written in (SWI-)PROLOG:

common_pre([[C|Cs]|Ss], [C|Res]) :-
  maplist(head_tail(C), [[C|Cs]|Ss], RemSs), !,
  common_pre(RemSs, Res).
common_pre(_, []).

head_tail(H, [H|T], T).

Running:

?- S=["interspecies", "interstelar", "interstate"], common_pre(S, CP), string_to_list(CPString, CP).

Gives:

CP = [105, 110, 116, 101, 114, 115],
CPString = "inters".

Explanation:

(SWI-)PROLOG treats strings as lists of character codes (numbers). All the predicate common_pre/2 does is recursively pattern-match to select the first code (C) from the head of the first list (string, [C|Cs]) in the list of all lists (all strings, [[C|Cs]|Ss]), and appends the matching code C to the result iff it is common to all (remaining) heads of all lists (strings), else it terminates.

Nice, clean, simple and efficient... :)

sharky 2009-12-17 06:13:13

Answer 18

+2 A:

Yet another way to do it: use regex greed.

words = %w(interspecies interstelar interstate)
j = '='
str = ['', *words].join(j)
re = "[^#{j}]*"

str =~ /^
    (?: #{j} ( #{re} ) #{re} )
    (?: #{j}    \1     #{re} )*
$/x

p $1

And the one-liner, courtesy of mislav (50 characters):

p ARGV.join(' ').match(/^(\w*)\w*(?: \1\w*)*$/)[1]

FM 2009-12-17 14:27:14

Nice! A one-line version of this would be `words.join(" ").match(/^(\w*)\w*(?: \1\w*)*$/)[1]`. Note: the regexp would have to be updated slightly to make this work when matching strings that have spaces in them.

mislav 2009-12-18 11:41:18

Answer 19

A:

A javascript version based on @Svante's algorithm:

function commonSubstring(words){
    var iChar, iWord,
        refWord = words[0],
        lRefWord = refWord.length,
        lWords = words.length;
    for (iChar = 0; iChar < lRefWord; iChar += 1) {
        for (iWord = 1; iWord < lWords; iWord += 1) {
            if (refWord[iChar] !== words[iWord][iChar]) {
                return refWord.substring(0, iChar);
            }
        }
    }
    return refWord;
}

Protron 2010-09-14 15:51:27

ansaurus

tags:

views:

answers:

Find the longest common starting substring in a set of strings

Update: my favorite solutions

Ruby, 71 chars

related questions