Detect Proper Nouns with WordNet?

Nick Heiner picture Nick Heiner · Dec 28, 2009 · Viewed 7.5k times · Source

I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.

To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.

Answer

Rob Van Dam picture Rob Van Dam · Jan 2, 2010

Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.

Updated:

Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

That would eliminate false positives like this:

If you build it...
As you wish...
Oh Romeo, Romeo...

And still catch just the capitalized nouns in

In the Book of Mark it says...
Have you heard The Roots or The Who recently?

but still give you false positives on

Mark the first instance...
Book 'em, Danno.

because they could be, but without context you don't know.

If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).