ansaurus

Question

Finding if all elements in a vector<string> are in a string

Answer 1

+2 A:

Boost isn't magic. It'll just loop over that vector too. It depends on how efficient you need this to be; if it's not time intensive, then just write the simple loop.

(By the way, if you're wondering how to determine if a string contains another as a substring, use mystring.find(mysubstring).)

Edit: I may have been a little snarky with my first response, so I'll elaborate. If you're interested in efficiency, you can do it with one pass of the large string, which will be an improvement if the main string is much longer than the substrings.

The running time of the following is O(km + kn), where there are k substrings of max length m, and we're checking in a string of length n (as opposed to the naive algorithm, which is O(kmn)).

#include <cassert>
#include <list>
#include <map>
#include <string>

// store the substrings in a Trie data structure (http://en.wikipedia.org/wiki/Trie)
class Trie {
public:
  friend class TrieCounter;

  Trie(): m_parent(0) {}
  ~Trie() { for(Leaves::iterator it=m_leaves.begin();it!=m_leaves.end();++it) delete it->second; }

  const Trie *add(const char *str) {
    Leaves::iterator it = m_leaves.find(str[0]);
    if(it == m_leaves.end())
      it = m_leaves.insert(std::make_pair(str[0], new Trie(this))).first;

    if(!str[0])
      return it->second;
    return it->second->add(str + 1);
  }

  const Trie *get(char c) const {
    Leaves::const_iterator it = m_leaves.find(c);
    if(it == m_leaves.end())
      return 0;
    return it->second;
  }

  bool is_leaf() const { return m_leaves.empty(); }
  bool is_end_of_word() const { return m_leaves.find(0) != m_leaves.end(); }

private:
  Trie(Trie *parent): m_parent(parent) {}

private:
  Trie *m_parent;
  typedef std::map<char, Trie*> Leaves;
  Leaves m_leaves;
};

// a little counter helper class
class TrieCounter {
public:
  TrieCounter(const Trie& root): m_wordCount(0) { add(root); }

  unsigned word_count() const { return m_wordCount; }
  void use(const Trie& trie) {
    m_wordCount++;
    decr(trie);
  }

private:
  void add(const Trie& trie) {
    if(trie.is_leaf())
      return;

    m_count[&trie] = trie.m_leaves.size();
    for(Trie::Leaves::const_iterator it=trie.m_leaves.begin();it!=trie.m_leaves.end();++it)
      add(*it->second);
  }

  void decr(const Trie& trie) {
    Count::iterator it = m_count.find(&trie);
    assert(it != m_count.end());
    assert(it->second > 0);
    it->second--;
    if(it->second == 0)
      decr(*it->first->m_parent);
  }

private:
  typedef std::map<const Trie*, unsigned> Count;
  Count m_count;
  unsigned m_wordCount;
};

// and the list of substrings
class WordList {
public:
  WordList() {}
  void add(const std::string& str) { m_wordEnds.push_back(m_trie.add(str.c_str())); }
  unsigned size() const { return m_wordEnds.size(); }
  const Trie& root() const { return m_trie; }

private:
  Trie m_trie;
  typedef std::list<const Trie *> Tries;
  Tries m_wordEnds;
};

unsigned count_in(const WordList& wordList, const std::string& str) {
  typedef std::list<const Trie *> Tries;
  typedef Tries::iterator Iter;

  TrieCounter counter(wordList.root());
  Tries curChecking;
  for(unsigned i=0;i<str.size();i++) {
    const char c = str[i];
    curChecking.push_back(&m_wordList.root());

    Iter it = curChecking.begin();
    while(it != curChecking.end()) {
      Iter next = ++Iter(it);
      if(const Trie *child = (*it)->get(c)) {
        *it = child;
        if(child->is_end_of_word())
          counter.use(*child);
      } else {
        curChecking.erase(it);
      }
      it = next;
    }
  }
  return counter.word_count();
}

First set up your substrings:

WordList wordList;
wordList.add("foo");
wordList.add("lan");
wordList.add("found");
wordList.add("under");
wordList.add("next");
wordList.add("land");
wordList.add("new");
wordList.add("news");
wordList.add("on");
wordList.add("and");

and then

std::cout << count_in(wordList, "newfoundland") << "\n";

Jesse Beder 2010-04-21 03:25:15

well, I've found that a lot of answers for C++ questions just wind up invoking boost. And no, that's not what I'm wondering.

devin 2010-04-21 03:27:40

Are you sure boost isn't magic?

Brian R. Bondy 2010-04-21 03:50:22

@Brian, I suppose you're right :)

Jesse Beder 2010-04-21 03:52:11

I'm not familiar with Boost Regexp but I wouldn't be surprised if it could compile the set of strings into a single structure and perform a single pass over the text string.

Potatoswatter 2010-04-21 05:55:08

@Potatoswatter, it might be able to compile the set of strings into a single structure, but it wouldn't be able to search for them all (in any order) in a single pass.

Ken Bloom 2010-04-21 13:17:08

@Ken: regexp algorithms are by their nature single pass. Related: see http://en.wikipedia.org/wiki/String_searching_algorithm

Potatoswatter 2010-04-21 18:57:05

@Potatoswatter, I know that. I challenge you to write the regexp that does what he wants.

Ken Bloom 2010-04-21 19:14:56

@Ken: `str1|str2`.

Potatoswatter 2010-04-21 19:27:12

@Potatoswatter: he wants to know if *all* of the strings in the array are in the string he's searching. Your regex would match if only one of the strings is in the string he's searching. (Oh, and on many regex implementations, `str1|str2` uses backtracking so the time it takes to match is superlinear in the length of the string.)

Ken Bloom 2010-04-21 23:59:00

@Ken: So, use a regex function which returns successive matches within the input and converts the NFA expression to DFA before running. That may be impractical for many applications, but it's very much a textbook way of doing it and shouldn't be hard to find. (Again, I'm not familiar with Boost Regexp.)

Potatoswatter 2010-04-22 00:33:25

@Potatoswatter: then you have to count the matches (which is easy) and you have to compare every match with every other match to make sure they're all different (i.e. that you didn't match `str1` twice.)

Ken Bloom 2010-04-22 01:43:21

@Ken: Yes. This is easier if the regexp engine can somehow identify the final DFA state of the match. Unfortunately that info is often thrown out. `(str1)|(str2)` may be a workaround, with some implications on complexity and engine requirements.

Potatoswatter 2010-04-22 01:49:52

Of course, you're better off just getting the right tool for the job — a search engine — if performance is really critical.

Potatoswatter 2010-04-22 01:50:40

Answer 2

+4 A:

This uses the algorithms in the standard template library. They loop over the string internally, but that may be what you're looking for.

bool in(string string1,string string2){
  return string2.find(string1) != string::npos;
}

if (count_if(v.begin(),v.end(),bind2nd(ptr_fun(in),s1)) == v.size()){
   cout << "all match" << endl;
}

If your compiler supports C++0x lambdas, you might find a lambda to be clearer.

if (count_if(v.begin(),v.end(),
    [&](string substr){return s1.find(substr)!=string::npos;})
   == v.size()){
   cout << "all match" << endl;;
}

Ken Bloom 2010-04-21 03:29:01

ansaurus

tags:

views:

answers:

Finding if all elements in a vector<string> are in a string

related questions