I'm assuming that this is not homework. (If it is, you're one your own re plagiarism! ;-)
Below is a quick-and-dirty solution. The time complexity is O(m**2 * n)
where m
is the average string length and n
is the size of the array of strings.
An instance of Occurrence
keeps the set of indices which contain a given string. The commonOccurrences
routine scans a string array, calling captureOccurrences
for each non-null string. The captureOccurrences
routine puts the current index into an Occurrence
for each possible substring of the string it is given. Finally, commonOccurrences
forms the result set by picking only those Occurrences
that have at least two indices.
Note that your example data has many more common substrings than you identified in the question. For example, "00ab"
occurs in each of the input strings. An additional filter to select interesting strings based on content (e.g. all digits, all alphabetic, etc.) is -- as they say -- left as an exercise for the reader. ;-)
QUICK AND DIRTY JAVA SOURCE:
package com.stackoverflow.answers;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
public class CommonSubstringFinder {
public static final int MINIMUM_SUBSTRING_LENGTH = 2;
public static class Occurrence implements Comparable<Occurrence> {
private final String value;
private final Set<Integer> indices;
public Occurrence(String value) {
this.value = value == null ? "" : value;
indices = new TreeSet<Integer>();
}
public String getValue() {
return value;
}
public Set<Integer> getIndices() {
return Collections.unmodifiableSet(indices);
}
public void occur(int index) {
indices.add(index);
}
public String toString() {
StringBuilder result = new StringBuilder();
result.append('"').append(value).append('"');
String separator = ": ";
for (Integer i : indices) {
result.append(separator).append(i);
separator = ",";
}
return result.toString();
}
public int compareTo(Occurrence that) {
return this.value.compareTo(that.value);
}
}
public static Set<Occurrence> commonOccurrences(String[] strings) {
Map<String,Occurrence> work = new HashMap<String,Occurrence>();
if (strings != null) {
int index = 0;
for (String string : strings) {
if (string != null) {
captureOccurrences(index, work, string);
}
++index;
}
}
Set<Occurrence> result = new TreeSet<Occurrence>();
for (Occurrence occurrence : work.values()) {
if (occurrence.indices.size() > 1) {
result.add(occurrence);
}
}
return result;
}
private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
final int maxLength = string.length();
for (int i = 0; i < maxLength; ++i) {
for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
String partial = string.substring(i, j);
Occurrence current = work.get(partial);
if (current == null) {
current = new Occurrence(partial);
work.put(partial, current);
}
current.occur(index);
}
}
}
private static final String[] TEST_DATA = {
"0000abcde0000",
"0000abcd00000",
"000abc0000000",
"00abc000de000",
};
public static void main(String[] args) {
Set<Occurrence> found = commonOccurrences(TEST_DATA);
for (Occurrence occurrence : found) {
System.out.println(occurrence);
}
}
}
SAMPLE OUTPUT: (note that there was actually only one Occurrence per line; I can't seem to prevent the blockquote markup from merging lines)
"00": 0,1,2,3
"000": 0,1,2,3
"0000": 0,1,2
"0000a": 0,1
"0000ab": 0,1
"0000abc": 0,1
"0000abcd": 0,1
"000a": 0,1,2
"000ab": 0,1,2
"000abc": 0,1,2
"000abcd": 0,1
"00a": 0,1,2,3
"00ab": 0,1,2,3
"00abc": 0,1,2,3
"00abc0": 2,3
"00abc00": 2,3
"00abc000": 2,3
"00abcd": 0,1
"0a": 0,1,2,3
"0ab": 0,1,2,3
"0abc": 0,1,2,3
"0abc0": 2,3
"0abc00": 2,3
"0abc000": 2,3
"0abcd": 0,1
"ab": 0,1,2,3
"abc": 0,1,2,3
"abc0": 2,3
"abc00": 2,3
"abc000": 2,3
"abcd": 0,1
"bc": 0,1,2,3
"bc0": 2,3
"bc00": 2,3
"bc000": 2,3
"bcd": 0,1
"c0": 2,3
"c00": 2,3
"c000": 2,3
"cd": 0,1
"de": 0,3
"de0": 0,3
"de00": 0,3
"e0": 0,3
"e00": 0,3