I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK.
But it requires a predefined context-free grammar as below:
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can assume any word ending with -ed or -ing as a verb. So it should work for any given context.
How can I feed such grammar rules to NLTK? Or generate them dynamically using Finite State Machine?
If you are creating a parser, then you have to add a step of pos-tagging before the actual parsing -- there is no way to successfully determine the POS-tag of a word out of context. For example, 'closed' can be an adjective or a verb; a POS-tagger will find out the correct tag for you from the context of the word. Then you can use the output of the POS-tagger to create your CFG.
You can use one of the many existing POS-taggers. In NLTK, you can simply do something like:
import nltk
input_sentence = "Dogs chase cats"
text = nltk.word_tokenize(input_sentence)
list_of_tokens = nltk.pos_tag(text)
print list_of_tokens
The output will be:
[('Dogs', 'NN'), ('chase', 'VB'), ('cats', 'NN')]
which you can use to create a grammar string and feed it to nltk.parse_cfg()
.