views:

303

answers:

7

What algorithms could i use to determine common characters in a set of strings?

To make the example simple, I only care about 2+ characters in a row and if it shows up in 2 or more of the sample. For instance:

  1. 0000abcde0000
  2. 0000abcd00000
  3. 000abc0000000
  4. 00abc000de000

I'd like to know:

00 was used in 1,2,3,4
000 was used in 1,2,3,4
0000 was used in 1,2,3
00000 was used in 2,3
ab was used in 1,2,3,4
abc was used in 1,2,3,4
abcd was used in 1,2
bc was used in 1,2,3,4
bcd was used in 1,2
cd was used in 1,2
de was used in 1,4

A: 

you can use an analysis of distance matrix. Any diagonal movement (no cost change) is an exact match.

nlucaroni
Okay... I don't know what you just said, but it sounds good. :)
Jason
+2  A: 

This is most probably an NP-hard problem. It looks similar to multiple sequence alignment, which is. Basically, you could adapt multidimensional Smith-Waterman (= local sequence alignment) for your needs. There might be a more efficient algorithm, though.

Konrad Rudolph
I don't think it is. I think it can be solved in O(N*M) time (N = total char count, M = max length considered)
BCS
+1  A: 

Do you know the "values" you need to search for ahead of time? Or do you need code to parse the strings, and give you stats like you posted?

Using the Boyer-Moore algorithm is a very quick way to tell if substrings exist (and even locate them), if you know what you are looking for ahead of time.

LarryF
In this case, where multiple patterns are searched, Wu-Manber would be *much* more efficient than single pattern searches like Boyer-Moore.
Konrad Rudolph
+2  A: 

Build a tree where the path through the tree is the letter sequence. Have each node contain a "set" that the string references are added to in passing (or just keep a count). Then keep track of N locations in the word where N is the longest sequence you care about (e.g., start a new handle at each char walking all handles down at each step and abort each handle after N steps)

This would work better with a small, finite and dense alphabet (DNA was the first place I thought to use it).

Edit: If you known in advance the pattern you care about, the above can be altered to work by building the tree ahead of time and then only checking to see if you are on the tree rather than extending it.

an example

input

abc
abd
abde
acc
bde

the tree

a : 4
  b : 3
    c : 1
    d : 2
      e : 1
  c : 1
    c : 1
b : 4
  d : 3
    e : 2
  c : 1
c : 3
  c : 1
d : 3
  e : 2
BCS
I'm not sure I follow, but it sounds interesting... is this illustrated somewhere online that you can give me a link to it?
Jason
Added an example to the answer (sorry I've never actually seen this approche described anywhere)
BCS
This looks similar to a trie (http://en.wikipedia.org/wiki/Trie)
Matt J
Thanks... the example helps, but it only starts at the beginning of the string. In your example, you have 3 occurrences of DE in different branches. How is this useful?
Jason
DE does show up in three different places, however if you follow the root->d->e path, the count you find is 2, because there is 2 places where the "de" substring is found. You don't need to go looking for other places in the tree that include "de" as all of them are also counted at root->d->e.
BCS
p.s. clarification: "de" is found twice in the input and three time in the tree.
BCS
+3  A: 

I'm assuming that this is not homework. (If it is, you're one your own re plagiarism! ;-)

Below is a quick-and-dirty solution. The time complexity is O(m**2 * n) where m is the average string length and n is the size of the array of strings.

An instance of Occurrence keeps the set of indices which contain a given string. The commonOccurrences routine scans a string array, calling captureOccurrences for each non-null string. The captureOccurrences routine puts the current index into an Occurrence for each possible substring of the string it is given. Finally, commonOccurrences forms the result set by picking only those Occurrences that have at least two indices.

Note that your example data has many more common substrings than you identified in the question. For example, "00ab" occurs in each of the input strings. An additional filter to select interesting strings based on content (e.g. all digits, all alphabetic, etc.) is -- as they say -- left as an exercise for the reader. ;-)

QUICK AND DIRTY JAVA SOURCE:

package com.stackoverflow.answers;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

public class CommonSubstringFinder {

    public static final int MINIMUM_SUBSTRING_LENGTH = 2;

    public static class Occurrence implements Comparable<Occurrence> {
        private final String value;
        private final Set<Integer> indices;
        public Occurrence(String value) {
            this.value = value == null ? "" : value;
            indices = new TreeSet<Integer>();
        }
        public String getValue() {
            return value;
        }
        public Set<Integer> getIndices() {
            return Collections.unmodifiableSet(indices);
        }
        public void occur(int index) {
            indices.add(index);
        }
        public String toString() {
            StringBuilder result = new StringBuilder();
            result.append('"').append(value).append('"');
            String separator = ": ";
            for (Integer i : indices) {
                result.append(separator).append(i);
                separator = ",";
            }
            return result.toString();
        }
        public int compareTo(Occurrence that) {
            return this.value.compareTo(that.value);
        }
    }

    public static Set<Occurrence> commonOccurrences(String[] strings) {
        Map<String,Occurrence> work = new HashMap<String,Occurrence>();
        if (strings != null) {
            int index = 0;
            for (String string : strings) {
                if (string != null) {
                    captureOccurrences(index, work, string);
                }
                ++index;
            }
        }
        Set<Occurrence> result = new TreeSet<Occurrence>();
        for (Occurrence occurrence : work.values()) {
            if (occurrence.indices.size() > 1) {
                result.add(occurrence);
            }
        }
        return result;
    }

    private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
        final int maxLength = string.length();
        for (int i = 0; i < maxLength; ++i) {
            for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
                String partial = string.substring(i, j);
                Occurrence current = work.get(partial);
                if (current == null) {
                    current = new Occurrence(partial);
                    work.put(partial, current);
                }
                current.occur(index);
            }
        }
    }

    private static final String[] TEST_DATA = {
        "0000abcde0000",
        "0000abcd00000",
        "000abc0000000",
        "00abc000de000",
    };
    public static void main(String[] args) {
        Set<Occurrence> found = commonOccurrences(TEST_DATA);
        for (Occurrence occurrence : found) {
            System.out.println(occurrence);
        }
    }

}

SAMPLE OUTPUT: (note that there was actually only one Occurrence per line; I can't seem to prevent the blockquote markup from merging lines)

"00": 0,1,2,3 "000": 0,1,2,3
"0000": 0,1,2 "0000a": 0,1
"0000ab": 0,1 "0000abc": 0,1
"0000abcd": 0,1 "000a": 0,1,2
"000ab": 0,1,2 "000abc": 0,1,2
"000abcd": 0,1 "00a": 0,1,2,3
"00ab": 0,1,2,3 "00abc": 0,1,2,3
"00abc0": 2,3 "00abc00": 2,3
"00abc000": 2,3 "00abcd": 0,1
"0a": 0,1,2,3 "0ab": 0,1,2,3
"0abc": 0,1,2,3 "0abc0": 2,3
"0abc00": 2,3 "0abc000": 2,3
"0abcd": 0,1 "ab": 0,1,2,3 "abc": 0,1,2,3 "abc0": 2,3 "abc00": 2,3
"abc000": 2,3 "abcd": 0,1 "bc": 0,1,2,3 "bc0": 2,3 "bc00": 2,3
"bc000": 2,3 "bcd": 0,1 "c0": 2,3 "c00": 2,3 "c000": 2,3 "cd": 0,1
"de": 0,3 "de0": 0,3 "de00": 0,3
"e0": 0,3 "e00": 0,3

joel.neely
Very impressive, thank you. BTW... not for homework. ;)
Jason
+1  A: 

Look up "Suffix Trees" on the web. Or pick up "Algorithms on Strings, Trees and Sequences" by Dan Gusfield. I don't have the book with me to verify, but the wikipedia page on suffix trees says that page 205 contains a solution for your problem: "finding the longest substrings common to at least k strings in a set".

florin
A: 

You may find a suffix array simpler and more efficient than a suffix tree, depending on how frequent common substrings are in your data -- if they're common enough, you'll need the more sophisticated suffix-array construction algorithm. (The naive method is to just use your library sort function.)

Darius Bacon