tags:

views:

79

answers:

4

Is there a web service or a tool to detect if what a certain text is the name or a person, a place or an object (device).

eg:

Input: Bill Clinton Output: Person

Input: Blackberry Output: Device

Input: New york Output: Place

Accuracy can be low. I have looked at opencyc but I couldnt get it to work. Is there a way I can use WikiPedia for this?

For a start separating a person or a thing will be great.

+1  A: 

I think wikipedia would be a very good source. Given the input, you could try and find an entry in wikipedia and scrape the resulting page (if it exists).

Persons and Places should have fairly distinct sets of data - birthdates, locations, etc in the article that you could use to tell them apart, and anything else is an object.

It's worth a shot anyway.

Brian Ramsay
+1  A: 

How about using a search engine? Google would be good, and I think Yahoo! has tools for building your own search.

I googled:

Results 1 - 10 of about 27,100,000 for "bill clinton" person
Results 1 - 10 of about 6,050,000 for "bill clinton" place
Results 1 - 10 of about 601,000 for "bill clinton" device

He's a person!

Results 1 - 10 of about 391,000,000 for "new york" place.
Results 1 - 10 of about 280,000,000 for "new york" person.
Results 1 - 10 of about 84,100,000 for "new york" device.

It's a place!

Results 1 - 10 of about 11,000,000 for "blackberry" person
Results 1 - 10 of about 36,600,000 for "blackberry" place
Results 1 - 10 of about 28,000,000 for "blackberry" device

Unfortunately, blackberry is a place as well. :-/

Note that only in the case of 'blackberry' did "device" even get close. Maybe you need to weight the page hit values. What is your application? Do you have any idea which "devices" you'd have to classify? What is the possible range of inputs?

Maybe you want to combine the results you get from different sources.

Nosredna
Although Google is obviously very useful, I've found it makes a horrible classifier, since their query format doesn't always reflect the semantic relationship you're looking for. It might work better to extract the text from the top N Google results, and classify that using a SVM trained to predict person/place/device based on a bag-of-words.
Chris S
+1  A: 

Looking at the output of Wolfram Alpha, it seems that you can possibly identify a person by searching Bill Clinton Birthday or just Bill Clinton, or you can identify a location by searching New York GPS coordinates or just New York, for even better results. Blackberry seems like a tough word for Alpha, because it keeps wanting to interpret it as a fruit. You might have luck searching Froogle to identify a device.

It seems like WA will give you a fairly decent accuracy, at least if you're using famous people/places.

Mark Rushakoff
+1 Looks like wolfram alpha is pretty accurate - "blackberry" as a device is more specific than "blackberry" as a fruit - the OP doesn't need perfect accuracy, certainly can't expect any computed algorithm to divine the intent of the input without further context.
Jeffrey Kemp
A: 

I think the basic task you're trying to accomplish is more formally known as named entity recognition. This task is nontrivial, and by only inputting the name stripped of any context, you're making it even harder.

For example, we'd like to think examples such as "Bill Clinton" and "New York" are obviously unambiguous, but looking at their disambiguation pages in Wikipedia shows that there are several potential entities they may refer to. "New York" is both a state, city, and movie title. "Bill Clinton" is a bit less ambiguous if you're only looking at Wikipedia, but I'm sure you'll find dozens of Bill Clintons in any phonebook. It might also be the name of someone's sailboat or pet dog. What if someone inputs "Washington"? That could be both a U.S. President, state, district, city, lake, street, island, movie, one of several U.S. navy ships, bridge, as well as other things. Determining which is the "correct" usage you'd want the webservice to return could become very complicated.

As much as Cyc knows, I think you'll find it's still not as comprehensive as Wikipedia. However, the main downside to Wikipedia is that it's essentially unstructured. Personally, I find Cyc's API so convoluted and poorly documented, that parsing Wikipedia's natural language almost seems easier.

If I had to implement such a webservice from scratch, I'd start by downloading a snapshot of Wikipedia, and then writing a parser that would read through all the articles, and generate a named entity index based on article titles. You could manually "classify" a few dozen examples as person/place/object, and train a classifier (Bayesian,Maxent,SVM) to automatically classify other examples based on the word frequencies of their articles.

Chris S