Let's say I have-
String x = "ab";
String y = "xypa";
If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.
Let's say I have-
String x = "ab";
String y = "xypa";
If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.
Well, I'm not sure it's better than looping, but you could use String#matches
:
if (y.matches(".*[" + x + "]+.*")) ...
You'd need to escape characters that are special in a regex []
construct, though (like ]
, -
, \
, ...).
The above is just an example, if you're doing it more than once, you'll want to use Pattern
, Matcher
, and the other stuff from the java.util.regex
package.
Looping is time-consuming, but there's no way to do what you want other than going over the target string repeatedly.
What you can do is optimize by checking the smallest strings first, and work your way up. For example, if the target string doesn't contain abc
, it can't possibly contain abcdef
.
Other optimizations off the top of my head:
HashSet
.You have to use for loop or use regex which is just as expensive as a for loop, becasue you need to convert one of your strings into chars basically.
Boolean isSubset = false;
for(int i = 0; i < x.length(); i++) {
if(y.contains(x.charAt(i))) {
isSubset = true;
break;
}
}
using a for loop.
It looks like this could be a case of the longest common substring problem.
The answer really depends on many things.
If you just want to find any subset and you're doing this only once, looping is just fine (and the best you can do without using additional storage) and you can stop when you find a single character that matches.
If you have a fixed x
and want to use it for matching several strings y
, you can do some pre-processing to store the characters in x
in a table and use this table to check if each character of y
occurs in x
or not.
If you want to find the largest subset, then you're looking at a different problem: the longest common subsequence problem.
You can generate all subsets of x (e.g. , in your example, ab, a, b) and then generate a regexp that would do the
Pattern p = Pattern.compile("(ab|a|b)");
Matcher m = p.matcher(y);
if(m.find()) {
System.err.println(m.group());
}
If both Strings will only contain [a-z]
. Then fastest would be to make two bitmaps, 26 bits longs. Mark all the bits contained in the String. Take the AND
of the bitmaps, the resulting bits are present in both Strings, the largest common subset. This would be a simple O(n)
with n
the length of the biggest String.
(If you want to cover the whole lot of UTF, bloom filters might be more appropriate. )
SO I think this is a place where the power of a unique set could be handy.
If you put all the value of both strings into two different HashSets, you can then run a containsAll to determine if one is a subset of the other.
What about this:?
package so3935620;
import static org.junit.Assert.*;
import java.util.BitSet;
import org.junit.Test;
public class Main {
public static boolean overlap(String s1, String s2) {
BitSet bs = new BitSet();
for (int i = 0; i < s1.length(); i++) {
bs.set(s1.charAt(i));
}
for (int i = 0; i < s2.length(); i++) {
if (bs.get(s2.charAt(i))) {
return true;
}
}
return false;
}
@Test
public void test() {
assertFalse(overlap("", ""));
assertTrue(overlap("a", "a"));
assertFalse(overlap("abcdefg", "ABCDEFG"));
}
}
And if that version is too slow, you can compute the BitSet depending on s1
, save that in some variable and later only loop over s2
.