views:

122

answers:

5

Note: Edited for clarification.

Clarification: I'm writing a "bridge" between the user and a search engine, not a search engine. Part of my value add will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this "inference engine."

Original question: I'm writing a parser and I want to take any given token and give it a category. Here are some theoretical examples. (I'm limiting to English for now)

"denver" is a USCITY and a PLACENAME
"aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
"555 555 5555" is a USPHONENUMBER
etc...

I know that each of these cases will most likely require specific handling, however I'm not sure where to start.

Ideally I'd end up with something simple like:

queryCategory = magicCategoryFinder( query )

    >print queryCategory
    >"SOMECATEGORY or a list"
+3  A: 

Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.

also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.

And if you can solve that problem reliably you can make yourself very wealthy.

What's all this in aid of anyway?

cletus
(+1) For describing in more detail than I did, why it is a hard problem.
harschware
The question does not ask for disambiguation. As you see in the examples, it allows multiple categories to be output, so Java would simply be a language, a type of coffee, an island and a stock symbol all at the same time.
Max Shawabkeh
I'm working to categorise search queries. My research indicates that a high percentage (60%+) of queries are somewhat unambiguous and if I can properly categorise them then I can present a search engine which skips the search results page in certain cases. The best example of this is a UPS tracking #. The likelihood of someone "searching" for such a number is extremely low. A ticker symbol (if unambiguous) is similar, and driving directions, address, etc... In the case of ambiguity I can simply present regular search results.
Art
Max S - exactly, a higher level part of my system will then determine a disambiguation based on the the categorisations, if possible.
Art
Just thought i'd mention that search engines are quite literally in the business of bringing you to the results page ;)(Unless of course you're talking about something internal or non-commercial)
Cogwheel - Matthew Orlando
+1  A: 

You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.

harschware
+3  A: 

To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).

Alex Martelli
thanks for the heads up on the term "tagging"
Art
+1  A: 

As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:

  1. Time flies like an arrow.
  2. Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.

It can be done -- but it really is decidedly non-trivial.

Jerry Coffin
+1  A: 

Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Max Shawabkeh