ansaurus

Question

Answer 1

A:

Would it be too slow to turn each phrase into a regex and run each one on the document, counting the number of occurrences?

If that doesn't work, maybe you can combine all the phrases into one huge regex (using |), and compile it. Then, run that huge regex starting from every character in the document. Count the number of matches as you go through the characters.

Claudiu 2009-12-31 17:42:25

I can trade time to build an index, so that looking up a list of phrases (for a given document) is as fast as possible.

Tourch 2009-12-31 17:47:47

Answer 2

A:

How large is the database of phrases? I am assuming that it is very large.

I would do the following:

Index the phrases by one of the words in it. You might choose the least common word in each phrase.You might make the search better by assuming that the word is at least e.g. 5 characters long, and padding the word to 5 chars if it is shorter. The padding can be the space after the word, followed by the subsequent word, to reduce matches, or some default character (e.g. "XX") if the word occurs at the end of the phrase.
Go through your document, converting each word (common ones can be discarded) to a key by padding if necessary, retrieving phrases.
Retrieve the relevant phrases by these keywords.
Use an in-memory text search to find the number of occurrences of each of the retrieved phrases.
I am assuming that phrases cannot cross a sentence boundary. In this case, you can read each sentence of the document into a substring in an array, and use the substring function to search through each sentence for each of the phrases and count occurrences, keeping a running sum for each phrase.

Larry Watanabe 2009-12-31 17:56:06

Answer 3

+2 A:

I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.

Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.

First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.

Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.

When you find a match, you do the actual string comparison to see if you found a match.

Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:

            HashSet<Long> foundHashes = new HashSet<Long>();

            LinkedList<String> words = new LinkedList<String>();
            for(int i=0; i<params.maxPhrase; i++) words.addLast("");

            StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
            Token t = new Token();
            while(st.next(t) != null) {
                String token = new String(t.termBuffer(), 0, t.termLength());
                words.addLast(token);
                words.removeFirst();

                for(int len=params.minPhrase; len<params.maxPhrase; len++) {
                    String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");

                    long hash = Utils.longHash(term);

                    if(params.lexicon.isTermHash(hash)) {
                        foundHashes.add(hash);
                    }
                }
            }

            for(long hash : foundHashes) {
                if(count.containsKey(hash)) {
                    count.put(hash, count.get(hash) + 1);
                } else {
                    count.put(hash, 1);
                }
            }

itsadok 2009-12-31 18:43:39

A few hundred milliseconds is acceptable. I'll give this approach a go

Tourch 2009-12-31 19:21:57

Answer 4

A:

Maybe reading Peter Turney on keyphrase extraction will give you some ideas. Overall, his approach has some similarity to what itsadok has suggested.

Yuval F 2010-01-03 10:49:01

ansaurus

tags:

views:

answers:

Inverted search: Phrases per document

related questions