views:

576

answers:

5

I want an algorithm which would create all possible phrases in a block of text. For example, in the text:

"My username is click upvote. I have 4k rep on stackoverflow"

It would create the following combinations:

"My username"
"My Username is"
"username is click"
"is click"
"is click upvote"
"click upvote"
"i have"
"i have 4k"
"have 4k"
..

You get the idea. Basically the point is to get all possible combinations of 'phrases' out of a sentence. Any thoughts for how to best implement this?

+5  A: 

Well, I don't know PHP or java, but basically you want a double loop over all the words in your text. Here's some pseudo-code:

words = split(text)
n = len(words)
for i in 1...n-1 {        // i = first word in phrase 
    for j in i+1...n {       // j = last word in phrase
        phrase = join(words[i:j])
        print phrase
    }
}

Note that the second loop starts from i, not 1. This gives you all phrases that start from word number i to word number j, which is greater than i (so all phrases have at least two words).

Ah, I just realized that you probably don't want phrases to cross sentence boundaries. So you'll want an outer loop which splits the text into sentences first, but then runs this on each sentence.

This seems pretty clear if you have any programming experience at all, but just in case: The for statements are loops [like for(i=1; i<=n; i++)], split is some function which takes a string and splits it into an array of words -- this is not completely trivial, but there's probably a library function to do this, len gives the length of the array, join puts them back together with spaces in between, and the syntax [i:j] means all of the elements from i to j inclusive (in python, this would actually be [i:j+1]). Oh, and I've implicitly assumed arrays start at index 1 rather than zero; I leave changing to 0-based C arrays as an exercise...

Finally, to answer the specific questions:

  • Note that the "second" loop is actually an inner loop; for each value of i (the first word of the phrase) we loop from i+1 to the end of the sentence to give the last word of the phrase.

  • Now that we have the number of the first and last words, the join function -- which you'll have to write -- concatenates the individual strings word[i], word[i+1], ... word[j] with spaces in between to form the phrase. In practice, this may mean the function could be declared like join(words, i, j) and returns the string, although some languages have ways to make this easier.

Andrew Jaffe
Can you translate the code to java?
Click Upvote
If you read his first sentence, you'd see he doesn't know PHP or Java. Additionally, the pseudo-code given should be simple enough to translate into Java yourself, given some basic Java knowledge and a little searching.
James Burgess
It would be if i could understand the pseudocode, it makes little sense to me. He has java as one of his tags..
Click Upvote
Yes, but that just means I answered some other question with java as a tag... Anyway, what don't you understand? See above for some hints or ask specific questions and perhaps I can help!
Andrew Jaffe
the 2nd for loop and the join() function is what isn't clear
Click Upvote
+4  A: 

Basically you need to first separate the block of text into sentences. That's tricky enough, even in English since you need to look out for periods, question marks, exclamation marks and any other sentence terminators.

Then you process one sentence at a time after removing all punctuation (commas, semi-colons, colons, and so on).

Then, when you're left with an array of words, it becomes simpler:

for i = 1 to num_words-1:
    for j = i+1 to num_words:
        phrase = words[i through j inclusive]
        store phrase

That's it, pretty simple (after initial massaging of the text block, which may not be as simple as you think).

This will give you all phrases of two or more words in every sentence.

The separation into sentences, separation into words, removal of punctuation and so on will be the hardest bit but I've already shown you some simple initial rules to follow. The rest should be added every time a block of text breaks the algorithm.

Update:

As requested, here's some Java code which gives the phrases:

public class testme {
    public final static String text =
        "My username is click upvote." +
        " I have 4k rep on stackoverflow.";

 

    public static void procSentence (String sent) {
        System.out.println ("==========");
        System.out.println ("sentence [" + sent + "]");

        // Split sentence at whitspace into array.

        String [] sa = sent.split("\\s+");

        // Process each starting word.

        for (int i = 0; i < sa.length - 1; i++) {

            // Process each phrase.

            for (int j = i+1; j < sa.length; j++) {

                // Build the phrase.

                String phrase = sa[i];
                for (int k = i+1; k <= j; k++) {
                    phrase = phrase + " " + sa[k];
                }

                // This is where you have your phrase. I just
                // print it out but you can do whatever you
                // wish with it.
                System.out.println ("   " + phrase);
            }
        }
    }

 

    public static void main(String[] args) {
        // This is the block of text to process.

        String block = text;
        System.out.println ("block    [" + block + "]");

        // Keep going until no more sentences.

        while (!block.equals("")) {
            // Remove leading spaces.

            if (block.startsWith(" ")) {
                block = block.substring(1);
                continue;
            }

            // Find end of sentence.

            int pos = block.indexOf('.');

            // Extract sentence and remove it from text block.

            String sentence = block.substring(0,pos);
            block = block.substring(pos+1);

            // Process the sentence (this is the "meat").

            procSentence (sentence);

            System.out.println ("block    [" + block + "]");
        }
        System.out.println ("==========");
    }
}

which outputs:

block    [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
   My username
   My username is
   My username is click
   My username is click upvote
   username is
   username is click
   username is click upvote
   is click
   is click upvote
   click upvote
block    [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
   I have
   I have 4k
   I have 4k rep
   I have 4k rep on
   I have 4k rep on stackoverflow
   have 4k
   have 4k rep
   have 4k rep on
   have 4k rep on stackoverflow
   4k rep
   4k rep on
   4k rep on stackoverflow
   rep on
   rep on stackoverflow
   on stackoverflow
block    []
==========

Now, keep in mind this is pretty basic Java (some might say it's C written in a Java dialect :-). It's just meant to illustrate how to output word groupings from a sentence as you asked for.

It does not do all the fancy sentence detection and punctuation removal I mentioned in the original answer.

paxdiablo
Can you give a php/c/java like example of your for loop? I have trouble understanding what it does because i'm not familiar with the syntax. If you could show the code in java that'd be awesome
Click Upvote
+1  A: 

Just tokenize the sentence and use the CombinationGenerator. The algorithm is described by Kenneth H. Rosen, Discrete Mathematics and Its Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286.

Here's the code and example of use: http://www.merriampark.com/comb.htm

Cuga
Again (as in Jess' attempt) we don't want all possible combinations -- just successive entries. This is a much easier problem (solved above a couple of times)!
Andrew Jaffe
Ahhh... now I see.
Cuga
+1  A: 

Could play with str_word_count(); and build it as you like.

A: 

You might already know that the technical term for such phrases is Shingle. You can get shingles for input text with Lucene's ShingeMatrixFilter.

Shashikant Kore