I was asked about this question. I can only think of a O(nm) algorithm if n is the length of the 1st string and m is the length of the 2nd string.
views:
73answers:
3Well, you can do it in O(n + m). Just create a reference table showing whether character exists in first string. Something like this (pseudo-code in no particular language)
// fill the table
for (int i = 0; i < a.length; ++i) {
characterExists[a[i]] = true;
}
// iterate over second string
for (int i = 0; i < b.length; ++i) {
if !characterExists[b[i]] {
// remove char (or do whatever else you want)
}
}
Have you checked out the Boyer-Moore String Search Algorithm?
The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not. This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than 6*N; in 1980 it was shown to be no more than 4*N, until Cole's result in Sep 1991.
C implementation:
#include <limits.h>
#include <string.h>
#define ALPHABET_SIZE (1 << CHAR_BIT)
static void compute_prefix(const char* str, size_t size, int result[size]) {
size_t q;
int k;
result[0] = 0;
k = 0;
for (q = 1; q < size; q++) {
while (k > 0 && str[k] != str[q])
k = result[k-1];
if (str[k] == str[q])
k++;
result[q] = k;
}
}
static void prepare_badcharacter_heuristic(const char *str, size_t size,
int result[ALPHABET_SIZE]) {
size_t i;
for (i = 0; i < ALPHABET_SIZE; i++)
result[i] = -1;
for (i = 0; i < size; i++)
result[(size_t) str[i]] = i;
}
void prepare_goodsuffix_heuristic(const char *normal, size_t size,
int result[size + 1]) {
char *left = (char *) normal;
char *right = left + size;
char reversed[size+1];
char *tmp = reversed + size;
size_t i;
/* reverse string */
*tmp = 0;
while (left < right)
*(--tmp) = *(left++);
int prefix_normal[size];
int prefix_reversed[size];
compute_prefix(normal, size, prefix_normal);
compute_prefix(reversed, size, prefix_reversed);
for (i = 0; i <= size; i++) {
result[i] = size - prefix_normal[size-1];
}
for (i = 0; i < size; i++) {
const int j = size - prefix_reversed[i];
const int k = i - prefix_reversed[i]+1;
if (result[j] > k)
result[j] = k;
}
}
/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char *haystack, const char *needle) {
/*
* Calc string sizes
*/
size_t needle_len, haystack_len;
needle_len = strlen(needle);
haystack_len = strlen(haystack);
/*
* Simple checks
*/
if(haystack_len == 0)
return NULL;
if(needle_len == 0)
return haystack;
/*
* Initialize heuristics
*/
int badcharacter[ALPHABET_SIZE];
int goodsuffix[needle_len+1];
prepare_badcharacter_heuristic(needle, needle_len, badcharacter);
prepare_goodsuffix_heuristic(needle, needle_len, goodsuffix);
/*
* Boyer-Moore search
*/
size_t s = 0;
while(s <= (haystack_len - needle_len))
{
size_t j = needle_len;
while(j > 0 && needle[j-1] == haystack[s+j-1])
j--;
if(j > 0)
{
int k = badcharacter[(size_t) haystack[s+j-1]];
int m;
if(k < (int)j && (m = j-k-1) > goodsuffix[j])
s+= m;
else
s+= goodsuffix[j];
}
else
{
return haystack + s;
}
}
/* not found */
return NULL;
}
O(n+m) is possible by storing the characters in your first string in a set before iterating the second string, but you will need to be careful about the complexity guarantees of the container(s) you choose. For instance, in C++, the following:
std::set<unsigned char> to_remove(a.begin(), a.end());
std::vector<unsigned char> to_keep;
BOOST_FOREACH (unsigned char x, b)
if (to_remove.find(x) == to_remove.end())
to_keep.push_back(x);
std::string result(to_keep.begin(), to_keep.end());
is algorithmically correct but only achieves O(n log(n) + m log(n)) because creating the to_remove object involves sorting [n log(n)], and the find operation is actually log(n). Worse, if you had used string concatenation operations instead of std::vector's push_back, you might have found your costs rising to O(n log(n) + m log(n) + m^2) depending on your std::string implementation. To achieve O(n + m) you would need to use an unsorted_set.