views:

1297

answers:

12

I need to find occurrences of ~ 25 000 words within a text. What is the most suitable algorithm/library for this purpose?

target language is C++

+11  A: 

build a hashtable with the words, and scan throuhgt the text, for each word lookup in the wordtable and stuff the needed info (increment count, add to a position list, whatever).

Javier
Hashing 25k items is not terribly efficient in terms of space.
fatcat1111
You have to store all 25,000 words in memory anyways; the overhead of storing them in a hash table will not be more than a small constant factor. A few hundred kilobytes is no big deal with modern hardware.
Adam Rosenfield
Except that you still have to store the full wordlist somewhere, to handle the false positives from the Bloom filter. Only if the full dictionary doesn't fit in memory, and has to be accessed from slow I/O like disk, and there are plenty of negatives, does a Bloom filter improve performance.
Steve Jessop
I just learn from fatca1111 link in is post that you can reduce False Positive under 1% that seems very good if 99% of time you are very fast.
Daok
Depends on the application - if you are required to count the number of times each word appears exactly, then the false positive rate must be 0%, so any Bloom filter would need to be backed by an accurate lookup. Also depends on hardware - 25K words is nothing on a PC, for example.
Steve Jessop
Also depends what proportion of lookups are negatives - Bloom filters never give false negatives, so they're fantastic for applications where a negative means you "give up". Negative results are nearly free, but positive results still require a slow accurate lookup to confirm, so are no faster.
Steve Jessop
-1 This is going to be very slow. Strings have very inefficient hashes. Much better solutions are possible
Stephan Eggermont
@stephan: there are string hash functions that are O(1); but even the usual O(n) performs very well on human language words, since most are less than 10 chars long.
Javier
+11  A: 

A Bloom Filter may be your best bet. You initialize your filter with your search terms, then while reading your corpus can quickly check if each work is in the filter.

It is very space efficient, much better than simply hashing each word: with a 1% false-positive rate it should require only 9.6 bits per element. The false-positive rate is reduced by a factor of 10 for each additional 4.8 bits per element. Contrast this to plain hashing, which usually requires sizeof(int) == 32 bits per element.

I have an implementation in C# here: http://www.codeplex.com/bloomfilter

Here's an example, demonstrating its use with strings:

int capacity = 2000000; // the number of items you expect to add to the filter
Filter<string> filter = new Filter<string>(capacity);
filter.Add("Lorem");
filter.Add("Ipsum");
if (filter.Contains("Lorem"))
    Console.WriteLine("Match!");
fatcat1111
I have learn something new today! +1 dude!
Daok
A: 

viceBerg says:

I once used the Boyer-Moore algorithm and it was quite fast.

With Boyer-Moore, aren't you typically searching a block of text for a single string?

For a simple to implement solution go with the hash table approach suggested by Javier. The Bloom Filter suggested by FatCat1111 should work too... depending on the goals.

ceretullis
You're right. I'm removing my answer then...
vIceBerg
+13  A: 

I once used the Boyer-Moore algorithm and it was quite fast.

Boyer-Moore isn't apt for efficiently searching many words. There is actually a very efficient algorithm for doing just that, called the Wu-Manber algorithm. I'll post a reference implementation. Notice, however, that I did this some time ago for educational purpose only. Hence, the implementation isn't really apt for direct usage and can also be made more efficient.

It also uses the stdext::hash_map from the Dinkumware STL. Subsitute with std::tr1::unordered_map or an appropriate implementation.

There's an explanation of the algorithm in a lecture script from a lecture at the Freie Universität Berlin, held by Knut Reinert.

The original paper is also online (just found it again) but I don't particularly like the pseudocode presented there.

#ifndef FINDER_HPP
#define FINDER_HPP

#include <string>

namespace thru { namespace matching {

class Finder {
public:
    virtual bool find() = 0;

    virtual std::size_t position() const = 0;

    virtual ~Finder() = 0;

protected:
    static size_t code_from_chr(char c) {
        return static_cast<size_t>(static_cast<unsigned char>(c));
    }
};

inline Finder::~Finder() { }

} } // namespace thru::matching

#endif // !defined(FINDER_HPP)


#include <vector>
#include <hash_map>

#include "finder.hpp"

#ifndef WUMANBER_HPP
#define WUMANBER_HPP

namespace thru { namespace matching {

class WuManberFinder : public Finder {
public:

    WuManberFinder(std::string const& text, std::vector<std::string> const& patterns);

    bool find();

    std::size_t position() const;

    std::size_t pattern_index() const;

private:

    template <typename K, typename V>
    struct HashMap {
        typedef stdext::hash_map<K, V> Type;
    };

    typedef HashMap<std::string, std::size_t>::Type shift_type;
    typedef HashMap<std::string, std::vector<std::size_t> >::Type hash_type;

    std::string const& m_text;
    std::vector<std::string> const& m_patterns;
    shift_type m_shift;
    hash_type m_hash;
    std::size_t m_pos;
    std::size_t m_find_pos;
    std::size_t m_find_pattern_index;
    std::size_t m_lmin;
    std::size_t m_lmax;
    std::size_t m_B;
};

} } // namespace thru::matching

#endif // !defined(WUMANBER_HPP)


#include <cmath>
#include <iostream>

#include "wumanber.hpp"

using namespace std;

namespace thru { namespace matching {

WuManberFinder::WuManberFinder(string const& text, vector<string> const& patterns)
    : m_text(text)
    , m_patterns(patterns)
    , m_shift()
    , m_hash()
    , m_pos()
    , m_find_pos(0)
    , m_find_pattern_index(0)
    , m_lmin(m_patterns[0].size())
    , m_lmax(m_patterns[0].size())
    , m_B()
{
    for (size_t i = 0; i < m_patterns.size(); ++i) {
        if (m_patterns[i].size() < m_lmin)
            m_lmin = m_patterns[i].size();
        else if (m_patterns[i].size() > m_lmax)
            m_lmax = m_patterns[i].size();
    }

    m_pos = m_lmin;
    m_B = static_cast<size_t>(ceil(log(2.0 * m_lmin * m_patterns.size()) / log(256.0)));

    for (size_t i = 0; i < m_patterns.size(); ++i)
        m_hash[m_patterns[i].substr(m_patterns[i].size() - m_B)].push_back(i);

    for (size_t i = 0; i < m_patterns.size(); ++i) {
        for (size_t j = 0; j < m_patterns[i].size() - m_B + 1; ++j) {
            string bgram = m_patterns[i].substr(j, m_B);
            size_t pos = m_patterns[i].size() - j - m_B;

            shift_type::iterator old = m_shift.find(bgram);
            if (old == m_shift.end())
                m_shift[bgram] = pos;
            else
                old->second = min(old->second, pos);
        }
    }
}

bool WuManberFinder::find() {
    while (m_pos <= m_text.size()) {
        string bgram = m_text.substr(m_pos - m_B, m_B);
        shift_type::iterator i = m_shift.find(bgram);
        if (i == m_shift.end())
            m_pos += m_lmin - m_B + 1;
        else {
            if (i->second == 0) {
                vector<size_t>& list = m_hash[bgram];
                // Verify all patterns in list against the text.
                ++m_pos;
                for (size_t j = 0; j < list.size(); ++j) {
                    string const& str = m_patterns[list[j]];
                    m_find_pos = m_pos - str.size() - 1;
                    size_t k = 0;

                    for (; k < str.size(); ++k)
                        if (str[k] != m_text[m_find_pos + k])
                            break;

                    if (k == str.size()) {
                        m_find_pattern_index = list[j];
                        return true;
                    }
                }
            }
            else
                m_pos += i->second;
        }
    }

    return false;
}

size_t WuManberFinder::position() const {
    return m_find_pos;
}

size_t WuManberFinder::pattern_index() const {
    return m_find_pattern_index;
}

} } // namespace thru::matching

Example of usage:

vector<string> patterns;
patterns.push_back("announce");
patterns.push_back("annual");
patterns.push_back("annually");

WuManberFinder wmf("CPM_annual_conference_announce", patterns);

while (wmf.find())
    cout << "Pattern \"" << patterns[wmf.pattern_index()] <<
        "\" found at position " << wmf.position() << endl;
Konrad Rudolph
+2  A: 

As Javier says, the simplest solution is probably a hash table.

In C++, this can be implemented using an STL set. First add the 25,000 test words to the set, and then scan through each word in the text, using set.find(current_word) to evaluate whether the word is among the 25,000 test words.

set.find is logarithmically fast, so 25,000 test words shouldn't be too large. The algorithm is obviously linear in the number of words in the text.

Chris Johnson
STL set is a treemap, for this a hashmap (unordered_set, on some STL libs) amortises to O(1)
Javier
A: 

May be storing your initial dictionnary (the 25000 words) in a berkeley db hash table on disk, that you can probably use directly from c/c++ (i know you can do it from perl), and for each word in the text, query if it's present in the database.

stephanea
+5  A: 

if the corpus is so large, try to optimize it in this way:

compute an hash of each word you need to check, assigning to each char an integer prime number and then multiplying each number together;store each number->word in a multimap (you need to allow for multiple value on single key)

while scanning the wordlist, compute the hash in the same way for each word, then get the word(s) associated with the computed key on the hashmap. using integers as key, you have a retrieval of O(1); this way you could find in a really quick way if the processed word has some anagram (you multiplied characters) inside the map.

remember: you stored in the multimap the set of the word having that same hash, so you now need to find a match in this greatly reduced set. you need this additional check as the simply existence of the integer on the map does not equate to the existence of the word in the associated set: we are using hashing here to reduce the computational space of the problem, but this introduces collision which need to be disambiguated checking on each identified anagram.

Lorenzo Boccaccia
note, this is a special case of a bloom filter, where false positive generating collision are stored to avoid processing the subsequent disambiguation step, so instead to ave the bit at the key position, you maintain a set of the generating words
Lorenzo Boccaccia
A: 

You may also sort the text and the word-list alphabetically. When you have two sorted arraysyou can easily find the matches in linear time.

A: 

You want a Ternary Search Tree. A good implementation can be found here.

florin
+2  A: 

If the text you're searching is huge, then it might be worth doing some preprocessing: assemble your 25,000 words into a TRIE.

Scan to the start of the first word in the text, and start walking the TRIE as you walk through the letters of the word. If there's no transition in your TRIE, skip to the start of the next word and go back to the root of the TRIE. If you reach the end of the word, and you're at a word-ending node in the TRIE, you've found a match. Repeat for each word in the text.

If your text is merely large (rather than huge), then simply looking up each word in a hash table is probably sufficient.

Adrian McCarthy
+3  A: 

Use the Aho-Corasick algorithm. It was made for this application. You'll only need to read each letter in your search text once. I've recently implemented and used it with great results.

Imbue
A: 

Aho-Corasick algorithm is built specifically for that purpose: searching for many words at once.

AndreyT