How do I get a set of grammar rules from Penn Treebank using python & NLTK?

Question 1

How do I get a set of grammar rules from Penn Treebank using python & NLTK?

python parsing grammar nltk tagged-corpus

Matt Robinson · Aug 14, 2011 · Viewed 11.6k times · Source

Answer

Answer

If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.

Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.

Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])

Question 2

I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I would like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac) Many thanks

How do I get a set of grammar rules from Penn Treebank using python & NLTK?

Answer

Related questions