I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?
Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count