tags:

views:

498

answers:

3

I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.

To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.

A: 

That information doesn't seem to be specially stored in WordNet. You can however, look at the first word form of a noun sysnet to see if it's capitalized. Not sure how official that is but it seems to work telling that fly is not a proper noun and France is.

Pace
This would consider all nouns at the start of a sentence to be proper, though. I really wish we changed our grammatical rules for things like this as they introduce ambiguity and don't have any function as far as I can tell beyond aesthetics.
Matt Boehm
Actually, I still don't see what being at the start of a sentence has to do with things. Instead of only checking the noun-ness of words that are capitalized why not check every word for noun-ness first. Then check if it's capitalized in WordNet. It makes no difference if it's capitalized in your original document. WordNet will return "book" regardless of whether you passed in "book" or "Book". As for the "mark" and "Mark" problem just search all the forms of a noun synset to see if any of them are uncapitalized.
Pace
+3  A: 

Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.

Updated:

Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

That would eliminate false positives like this:

If you build it...
As you wish...
Oh Romeo, Romeo...

And still catch just the capitalized nouns in

In the Book of Mark it says...
Have you heard The Roots or The Who recently?

but still give you false positives on

Mark the first instance...
Book 'em, Danno.

because they could be, but without context you don't know.

If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).

Rob Van Dam
NER typically depends on having context present.
Ken Bloom
I modified my answer to reflect the lack of context.
Rob Van Dam
+1  A: 

Let me run this past you. You might have to do a run through some more books on English to gain insight into the fact that one cannot determine a word's part of speech out of context.

The best you could do is test for exclusion ... determining that WordNet knows of no usage in a given part of speech. In some cases you might find that only one part of speech is listed in WordNet. For example I know of no usage of "car" other than as a noun.

Distinguishing proper nouns from common ones is even more difficult. Certainly you can use the heuristic ... a noun which is not the initial word of a sentence and is capitalized but not in ALLCAPS is probably a proper noun.

Ultimately, the distinction is one of semantics rather than lexical analysis. I doubt you'll find a reasonably robust solution based on looking up words in WordNet. I think you'll need to do natural language grammatic parsing before you'll be able to reliably extract nouns, much less detect proper nouns in prose.

Jim Dennis