N-gram generation from a sentence

Preetam Purbia picture Preetam Purbia · Sep 7, 2010 · Viewed 52k times · Source

How to generate an n-gram of a string like:

String Input="This is my car."

I want to generate n-gram with this input:

Input Ngram size = 3

Output should be:

This
is
my
car

This is
is my
my car

This is my
is my car

Give some idea in Java, how to implement that or if any library is available for it.

I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.

Answer

aioobe picture aioobe · Sep 7, 2010

I believe this would do what you want:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

Output:

This
is
my
car.

This is
is my
my car.

This is my
is my car.

An "on-demand" solution implemented as an Iterator:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}