I need to find occurrences of ~ 25 000 words within a text. What is the most suitable algorithm/library for this purpose?
target language is C++
I need to find occurrences of ~ 25 000 words within a text. What is the most suitable algorithm/library for this purpose?
target language is C++
build a hashtable with the words, and scan throuhgt the text, for each word lookup in the wordtable and stuff the needed info (increment count, add to a position list, whatever).
A Bloom Filter may be your best bet. You initialize your filter with your search terms, then while reading your corpus can quickly check if each work is in the filter.
It is very space efficient, much better than simply hashing each word: with a 1% false-positive rate it should require only 9.6 bits per element. The false-positive rate is reduced by a factor of 10 for each additional 4.8 bits per element. Contrast this to plain hashing, which usually requires sizeof(int) == 32 bits per element.
I have an implementation in C# here: http://www.codeplex.com/bloomfilter
Here's an example, demonstrating its use with strings:
int capacity = 2000000; // the number of items you expect to add to the filter
Filter<string> filter = new Filter<string>(capacity);
filter.Add("Lorem");
filter.Add("Ipsum");
if (filter.Contains("Lorem"))
Console.WriteLine("Match!");
viceBerg says:
I once used the Boyer-Moore algorithm and it was quite fast.
With Boyer-Moore, aren't you typically searching a block of text for a single string?
For a simple to implement solution go with the hash table approach suggested by Javier. The Bloom Filter suggested by FatCat1111 should work too... depending on the goals.
I once used the Boyer-Moore algorithm and it was quite fast.
Boyer-Moore isn't apt for efficiently searching many words. There is actually a very efficient algorithm for doing just that, called the Wu-Manber algorithm. I'll post a reference implementation. Notice, however, that I did this some time ago for educational purpose only. Hence, the implementation isn't really apt for direct usage and can also be made more efficient.
It also uses the stdext::hash_map
from the Dinkumware STL. Subsitute with std::tr1::unordered_map
or an appropriate implementation.
There's an explanation of the algorithm in a lecture script from a lecture at the Freie Universität Berlin, held by Knut Reinert.
The original paper is also online (just found it again) but I don't particularly like the pseudocode presented there.
#ifndef FINDER_HPP
#define FINDER_HPP
#include <string>
namespace thru { namespace matching {
class Finder {
public:
virtual bool find() = 0;
virtual std::size_t position() const = 0;
virtual ~Finder() = 0;
protected:
static size_t code_from_chr(char c) {
return static_cast<size_t>(static_cast<unsigned char>(c));
}
};
inline Finder::~Finder() { }
} } // namespace thru::matching
#endif // !defined(FINDER_HPP)
#include <vector>
#include <hash_map>
#include "finder.hpp"
#ifndef WUMANBER_HPP
#define WUMANBER_HPP
namespace thru { namespace matching {
class WuManberFinder : public Finder {
public:
WuManberFinder(std::string const& text, std::vector<std::string> const& patterns);
bool find();
std::size_t position() const;
std::size_t pattern_index() const;
private:
template <typename K, typename V>
struct HashMap {
typedef stdext::hash_map<K, V> Type;
};
typedef HashMap<std::string, std::size_t>::Type shift_type;
typedef HashMap<std::string, std::vector<std::size_t> >::Type hash_type;
std::string const& m_text;
std::vector<std::string> const& m_patterns;
shift_type m_shift;
hash_type m_hash;
std::size_t m_pos;
std::size_t m_find_pos;
std::size_t m_find_pattern_index;
std::size_t m_lmin;
std::size_t m_lmax;
std::size_t m_B;
};
} } // namespace thru::matching
#endif // !defined(WUMANBER_HPP)
#include <cmath>
#include <iostream>
#include "wumanber.hpp"
using namespace std;
namespace thru { namespace matching {
WuManberFinder::WuManberFinder(string const& text, vector<string> const& patterns)
: m_text(text)
, m_patterns(patterns)
, m_shift()
, m_hash()
, m_pos()
, m_find_pos(0)
, m_find_pattern_index(0)
, m_lmin(m_patterns[0].size())
, m_lmax(m_patterns[0].size())
, m_B()
{
for (size_t i = 0; i < m_patterns.size(); ++i) {
if (m_patterns[i].size() < m_lmin)
m_lmin = m_patterns[i].size();
else if (m_patterns[i].size() > m_lmax)
m_lmax = m_patterns[i].size();
}
m_pos = m_lmin;
m_B = static_cast<size_t>(ceil(log(2.0 * m_lmin * m_patterns.size()) / log(256.0)));
for (size_t i = 0; i < m_patterns.size(); ++i)
m_hash[m_patterns[i].substr(m_patterns[i].size() - m_B)].push_back(i);
for (size_t i = 0; i < m_patterns.size(); ++i) {
for (size_t j = 0; j < m_patterns[i].size() - m_B + 1; ++j) {
string bgram = m_patterns[i].substr(j, m_B);
size_t pos = m_patterns[i].size() - j - m_B;
shift_type::iterator old = m_shift.find(bgram);
if (old == m_shift.end())
m_shift[bgram] = pos;
else
old->second = min(old->second, pos);
}
}
}
bool WuManberFinder::find() {
while (m_pos <= m_text.size()) {
string bgram = m_text.substr(m_pos - m_B, m_B);
shift_type::iterator i = m_shift.find(bgram);
if (i == m_shift.end())
m_pos += m_lmin - m_B + 1;
else {
if (i->second == 0) {
vector<size_t>& list = m_hash[bgram];
// Verify all patterns in list against the text.
++m_pos;
for (size_t j = 0; j < list.size(); ++j) {
string const& str = m_patterns[list[j]];
m_find_pos = m_pos - str.size() - 1;
size_t k = 0;
for (; k < str.size(); ++k)
if (str[k] != m_text[m_find_pos + k])
break;
if (k == str.size()) {
m_find_pattern_index = list[j];
return true;
}
}
}
else
m_pos += i->second;
}
}
return false;
}
size_t WuManberFinder::position() const {
return m_find_pos;
}
size_t WuManberFinder::pattern_index() const {
return m_find_pattern_index;
}
} } // namespace thru::matching
vector<string> patterns;
patterns.push_back("announce");
patterns.push_back("annual");
patterns.push_back("annually");
WuManberFinder wmf("CPM_annual_conference_announce", patterns);
while (wmf.find())
cout << "Pattern \"" << patterns[wmf.pattern_index()] <<
"\" found at position " << wmf.position() << endl;
As Javier says, the simplest solution is probably a hash table.
In C++, this can be implemented using an STL set. First add the 25,000 test words to the set, and then scan through each word in the text, using set.find(current_word) to evaluate whether the word is among the 25,000 test words.
set.find is logarithmically fast, so 25,000 test words shouldn't be too large. The algorithm is obviously linear in the number of words in the text.
May be storing your initial dictionnary (the 25000 words) in a berkeley db hash table on disk, that you can probably use directly from c/c++ (i know you can do it from perl), and for each word in the text, query if it's present in the database.
if the corpus is so large, try to optimize it in this way:
compute an hash of each word you need to check, assigning to each char an integer prime number and then multiplying each number together;store each number->word in a multimap (you need to allow for multiple value on single key)
while scanning the wordlist, compute the hash in the same way for each word, then get the word(s) associated with the computed key on the hashmap. using integers as key, you have a retrieval of O(1); this way you could find in a really quick way if the processed word has some anagram (you multiplied characters) inside the map.
remember: you stored in the multimap the set of the word having that same hash, so you now need to find a match in this greatly reduced set. you need this additional check as the simply existence of the integer on the map does not equate to the existence of the word in the associated set: we are using hashing here to reduce the computational space of the problem, but this introduces collision which need to be disambiguated checking on each identified anagram.
You want a Ternary Search Tree. A good implementation can be found here.
If the text you're searching is huge, then it might be worth doing some preprocessing: assemble your 25,000 words into a TRIE.
Scan to the start of the first word in the text, and start walking the TRIE as you walk through the letters of the word. If there's no transition in your TRIE, skip to the start of the next word and go back to the root of the TRIE. If you reach the end of the word, and you're at a word-ending node in the TRIE, you've found a match. Repeat for each word in the text.
If your text is merely large (rather than huge), then simply looking up each word in a hash table is probably sufficient.
Use the Aho-Corasick algorithm. It was made for this application. You'll only need to read each letter in your search text once. I've recently implemented and used it with great results.
Aho-Corasick algorithm is built specifically for that purpose: searching for many words at once.