views:

90

answers:

3

how to generate ngram of a string like

String Input="This is my car."

i want to generate Ngram of this input

Input Ngram size = 3

Output should come:

This
is
my
car

This is
is my
my car

This is my
is my car

give some idea in java, how to implement that or any library is available for it. I am trying to use this NGramTokenizer but its giving ngram's of character sequence and i want n gram of word sequence. http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/ngram/NGramTokenizer.html

Thanks for your help

+2  A: 

I believe this would do what you want:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

Output:

This
is
my
car.

This is
is my
my car.

This is my
is my car.

An "on-demand" solution implemented as an Iterator:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}
aioobe
A: 

This code returns an array of all Strings of the given length:

public static String[] ngrams(String s, int len) {
    String[] parts = s.split(" ");
    String[] result = new String[parts.length - len + 1];
    for(int i = 0; i < parts.length - len + 1; i++) {
       StringBuilder sb = new StringBuilder();
       for(int k = 0; k < len; k++) {
           if(k > 0) sb.append(' ');
           sb.append(parts[i+k]);
       }
       result[i] = sb.toString();
    }
    return result;
}

E.g.

System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car] 
Landei
`ngrams("This is my car", -3)` (sorry, couldn't resist)
wds
`ngrams("This is my car", -3)` works fine. `ngrams("This is my car", 6)` however, results in a `NegativeArraySizeException`.
aioobe
What do you expect in these cases? I'd suggest to put a test at the beginning of the method and return an empty array. Generally I see few SO answers with a sophisticated error handling.
Landei
+2  A: 

You are looking for ShingleFilter.

Shashikant Kore