tags:

views:

91

answers:

2

I have a vector and I need to see if all the strings in that vector are substrings of another given string.

eg

vector<string> v;
v.push_back("str1");
v.push_back("str2");
string s1 = "str1, str2, str3";
string s2 = "str1, str3";

Is there a way to get true from s1 and false from s2 without looping over the vector? Also, note that due to my environment, I can't use boost. I think if I had boost, I could do this.

+2  A: 

Boost isn't magic. It'll just loop over that vector too. It depends on how efficient you need this to be; if it's not time intensive, then just write the simple loop.

(By the way, if you're wondering how to determine if a string contains another as a substring, use mystring.find(mysubstring).)


Edit: I may have been a little snarky with my first response, so I'll elaborate. If you're interested in efficiency, you can do it with one pass of the large string, which will be an improvement if the main string is much longer than the substrings.

The running time of the following is O(km + kn), where there are k substrings of max length m, and we're checking in a string of length n (as opposed to the naive algorithm, which is O(kmn)).

#include <cassert>
#include <list>
#include <map>
#include <string>

// store the substrings in a Trie data structure (http://en.wikipedia.org/wiki/Trie)
class Trie {
public:
  friend class TrieCounter;

  Trie(): m_parent(0) {}
  ~Trie() { for(Leaves::iterator it=m_leaves.begin();it!=m_leaves.end();++it) delete it->second; }

  const Trie *add(const char *str) {
    Leaves::iterator it = m_leaves.find(str[0]);
    if(it == m_leaves.end())
      it = m_leaves.insert(std::make_pair(str[0], new Trie(this))).first;

    if(!str[0])
      return it->second;
    return it->second->add(str + 1);
  }

  const Trie *get(char c) const {
    Leaves::const_iterator it = m_leaves.find(c);
    if(it == m_leaves.end())
      return 0;
    return it->second;
  }

  bool is_leaf() const { return m_leaves.empty(); }
  bool is_end_of_word() const { return m_leaves.find(0) != m_leaves.end(); }

private:
  Trie(Trie *parent): m_parent(parent) {}

private:
  Trie *m_parent;
  typedef std::map<char, Trie*> Leaves;
  Leaves m_leaves;
};

// a little counter helper class
class TrieCounter {
public:
  TrieCounter(const Trie& root): m_wordCount(0) { add(root); }

  unsigned word_count() const { return m_wordCount; }
  void use(const Trie& trie) {
    m_wordCount++;
    decr(trie);
  }

private:
  void add(const Trie& trie) {
    if(trie.is_leaf())
      return;

    m_count[&trie] = trie.m_leaves.size();
    for(Trie::Leaves::const_iterator it=trie.m_leaves.begin();it!=trie.m_leaves.end();++it)
      add(*it->second);
  }

  void decr(const Trie& trie) {
    Count::iterator it = m_count.find(&trie);
    assert(it != m_count.end());
    assert(it->second > 0);
    it->second--;
    if(it->second == 0)
      decr(*it->first->m_parent);
  }

private:
  typedef std::map<const Trie*, unsigned> Count;
  Count m_count;
  unsigned m_wordCount;
};

// and the list of substrings
class WordList {
public:
  WordList() {}
  void add(const std::string& str) { m_wordEnds.push_back(m_trie.add(str.c_str())); }
  unsigned size() const { return m_wordEnds.size(); }
  const Trie& root() const { return m_trie; }

private:
  Trie m_trie;
  typedef std::list<const Trie *> Tries;
  Tries m_wordEnds;
};

unsigned count_in(const WordList& wordList, const std::string& str) {
  typedef std::list<const Trie *> Tries;
  typedef Tries::iterator Iter;

  TrieCounter counter(wordList.root());
  Tries curChecking;
  for(unsigned i=0;i<str.size();i++) {
    const char c = str[i];
    curChecking.push_back(&m_wordList.root());

    Iter it = curChecking.begin();
    while(it != curChecking.end()) {
      Iter next = ++Iter(it);
      if(const Trie *child = (*it)->get(c)) {
        *it = child;
        if(child->is_end_of_word())
          counter.use(*child);
      } else {
        curChecking.erase(it);
      }
      it = next;
    }
  }
  return counter.word_count();
}

First set up your substrings:

WordList wordList;
wordList.add("foo");
wordList.add("lan");
wordList.add("found");
wordList.add("under");
wordList.add("next");
wordList.add("land");
wordList.add("new");
wordList.add("news");
wordList.add("on");
wordList.add("and");

and then

std::cout << count_in(wordList, "newfoundland") << "\n";
Jesse Beder
well, I've found that a lot of answers for C++ questions just wind up invoking boost. And no, that's not what I'm wondering.
devin
Are you sure boost isn't magic?
Brian R. Bondy
@Brian, I suppose you're right :)
Jesse Beder
I'm not familiar with Boost Regexp but I wouldn't be surprised if it could compile the set of strings into a single structure and perform a single pass over the text string.
Potatoswatter
@Potatoswatter, it might be able to compile the set of strings into a single structure, but it wouldn't be able to search for them all (in any order) in a single pass.
Ken Bloom
@Ken: regexp algorithms are by their nature single pass. Related: see http://en.wikipedia.org/wiki/String_searching_algorithm
Potatoswatter
@Potatoswatter, I know that. I challenge you to write the regexp that does what he wants.
Ken Bloom
@Ken: `str1|str2`.
Potatoswatter
@Potatoswatter: he wants to know if *all* of the strings in the array are in the string he's searching. Your regex would match if only one of the strings is in the string he's searching. (Oh, and on many regex implementations, `str1|str2` uses backtracking so the time it takes to match is superlinear in the length of the string.)
Ken Bloom
@Ken: So, use a regex function which returns successive matches within the input and converts the NFA expression to DFA before running. That may be impractical for many applications, but it's very much a textbook way of doing it and shouldn't be hard to find. (Again, I'm not familiar with Boost Regexp.)
Potatoswatter
@Potatoswatter: then you have to count the matches (which is easy) and you have to compare every match with every other match to make sure they're all different (i.e. that you didn't match `str1` twice.)
Ken Bloom
@Ken: Yes. This is easier if the regexp engine can somehow identify the final DFA state of the match. Unfortunately that info is often thrown out. `(str1)|(str2)` may be a workaround, with some implications on complexity and engine requirements.
Potatoswatter
Of course, you're better off just getting the right tool for the job — a search engine — if performance is really critical.
Potatoswatter
+4  A: 

This uses the algorithms in the standard template library. They loop over the string internally, but that may be what you're looking for.

bool in(string string1,string string2){
  return string2.find(string1) != string::npos;
}

if (count_if(v.begin(),v.end(),bind2nd(ptr_fun(in),s1)) == v.size()){
   cout << "all match" << endl;
}

If your compiler supports C++0x lambdas, you might find a lambda to be clearer.

if (count_if(v.begin(),v.end(),
    [&](string substr){return s1.find(substr)!=string::npos;})
   == v.size()){
   cout << "all match" << endl;;
}
Ken Bloom