views:

132

answers:

4

I would like to build an internal search engine (I have a very large collection of thousands of XML files) that is able to map queries to concepts. For example, if I search for "big cats", I would want highly ranked results to return documents with "large cats" as well. But I may also be interested in having it return "huge animals", albeit at a much lower relevancy score.

I'm currently reading through the Natural Language Processing in Python book, and it seems WordNet has some word mappings that might prove useful, though I'm not sure how to integrate that into a search engine. Could I use Lucene to do this? How?

From further research, it seems "latent semantic analysis" is relevant to what I'm looking for but I'm not sure how to implement it.

Any advice on how to get this done?

+1  A: 

This is an incredibly hard problem and it can't be solved in a way that would always produce adequate results. I'd suggest to stick to some very simple principles instead so that the results are at least predictable. I think you need 2 things: some basic morphology engine plus a dictionary of synonyms.

Whenever a search query arrives, for each word you

  1. Look for a literal match
  2. "Normalize/canonicalze" the word using the morphology engine, i.e. make it singular, first form, etc and look for matches
  3. Look for synonyms of the word

Then repeat for all combinations of the input words, i.e. "big cats", "big cat", "huge cats" huge cat" etc.

In fact, you need to store your index data in canonical form, too (singluar, first form etc) along with the literal form.

As for concepts, such as cats are also animals - this is where it gets tricky. It never really worked, because otherwise Google would have been returning conceptual matches already, but it's not doing that.

mojuba
Great idea on using the normalization first. NLTK Library/WordNet can be used to do this for sure. I wouldn't dismiss the conceptual tagging as impractical because Google hasn't done it yet though. Google deals with very open ended queries on billions of pages. Doing conceptual searches for them opens up a can of worms, and besides users are generally only interested in the top 10 answers. That is, general searchers want high accuracy. For my app though, breadth is an important characteristic of the quality of the result. I don't want to miss out anything potentially relevant.
DevX
+7  A: 

I'm not sure how to integrate that into a search engine. Could I use Lucene to do this? How?

Step 1. Stop.

Step 2. Get something to work.

Step 3. By then, you'll understand more about Python and Lucene and other tools and ways you might integrate them.

Don't start by trying to solve integration problems. Software can always be integrated. That's what an Operating System does. It integrates software. Sometimes you want "tighter" integration, but that's never the first problem to solve.

The first problem to solve is to get your search or concept thing or whatever it is to work as a dumb-old command-line application. Or pair of applications knit together by passing files around or knit together with OS pipes or something.

Later, you can try and figure out how to make the user experience seamless.

But don't start with integration and don't stall because of integration questions. Set integration aside and get something to work.

S.Lott
Good point on starting simple. In this case though, potential customers for the app I'm building already have "normal" search engines. I have reason to believe that a more intelligent engine could add tangible value, which is why I'd like to know if it is a feasible problem to attack before I jump in to make a "me-too" product.
DevX
@DevX: Please slow down. A "more intelligent engine" is one thing. Build that first. Integration is the least of your worries. Save that for last after you get the "more intelligent engine" working. I'll repeat this, because you don't seem to be reading it: integration can be left for last, after you get some experience with the tools and solve the basic problem.
S.Lott
A: 

First , write a piece of python code that will return you pineapple , orange , papaya when you input apple. By focusing on "is" relation of semantic network. Then continue with has a relationship and so on.

I think at the end , you might get a fairly sufficient piece of code for a school project.

wizztjh
+1  A: 

First, I agree with most of the advice here about starting slow, and first building bits and pieces of this grand plan, devising a minimal first product and continuing from there. Second, if you want to use some Wordnet functionality with Lucene, there is a contrib package for interfacing Lucene queries with Wordnet. I have no idea whether it was ported over to pylucene. Good luck and be careful out there.

Yuval F