views:

1178

answers:

7

Hi there,

Am working on web based Job search application using Lucene.User on my site can search for jobs which are within a radius of 100 miles from say "Boston,MA" or any other location. Also, I need to show the search results sorted by "relevance"(ie. Score returned by lucene) in descending order.

I'm using a 3rd party API to fetch all the cities within given radius of a city.This API returns me around 864 cities within 100 miles radius of "Boston,MA".

I'm building the city/state Lucene query using the following logic which is part of my "BuildNearestCitiesQuery" method. Here nearestCities is a hashtable returned by the above API.It contains 864 cities with CityName ass key and StateCode as value. And finalQuery is a Lucene BooleanQuery object which contains other search criteria entered by the user like:skills,keywords,etc.

foreach (string city in nearestCities.Keys)

{

    BooleanQuery tempFinalQuery = finalQuery;

    cityStateQuery = new BooleanQuery();    

    queryCity = queryParserCity.Parse(city);

    queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);

    cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND

    cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);

} 


nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR



finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);

I then input finalQuery object to Lucene's Search method to get all the jobs within 100 miles radius.:

searcher.Search(finalQuery, collector);

I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds on an average to execute which obviously is unacceptable by any standards of a website.I also found out that the statements involving "Parse" take a considerable amount of time to execute as compared to other statements.

A job for a given location is a dynamic attribute in the sense that a city could have 2 jobs(meeting a particular search criteria) today,but zero job for the same search criteria after 3 days.So,I cannot use any "Caching" over here.

Is there any way I can optimize this logic?or for that matter my whole approach/algorithm towards finding all jobs within 100 miles using Lucene?

FYI,here is how my indexing in Lucene looks like:

doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED));

doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES, Field.Index.TOKENIZED));

doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO, Field.Index.TOKENIZED));

doc.Add(new Field("city", job.City.Trim(), Field.Store.YES, Field.Index.TOKENIZED , Field.TermVector.YES));

doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("citystate", job.City.Trim() + ", " + job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED , Field.TermVector.YES));

doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES, Field.Index.UN_TOKENIZED));

doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES, Field.Index.TOKENIZED));

doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED,Field.TermVector.YES));

doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("showAllJobs", "yy", Field.Store.NO, Field.Index.UN_TOKENIZED));

Thanks a ton for reading!I would really appreciate your help on this.

Janis

A: 

Apart from tempFinalQuery being unused and an unnecessary map lookup to get the state, there doesn't seem to be anything too egregious in the code you post. Apart from the formatting...

If all the time is taken in the Parse methods, posting their code here would make sense.

Alabaster Codify
A: 

Hi,

Thanks for your response..... Here is that code which is calling Parse method (in foreach loop)and is taking the maximum amount of time.

    public void BuildNearestCitiesQuery(Hashtable nearestCities, string[] fields, BooleanQuery finalQuery)
    {
        QueryParser queryParserCity = new QueryParser("city", _analyzer);
        QueryParser queryParserState = new QueryParser("state", _analyzer);


        //Base City Query
        BooleanQuery baseCityStateQuery = new BooleanQuery();
        Query queryCity = queryParserCity.Parse(this._baseCity);
        Query queryState = queryParserState.Parse(this._baseStateCode);
        baseCityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an "AND"       
        baseCityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
        BooleanQuery nearestCityQuery = new BooleanQuery();
        nearestCityQuery.Add(baseCityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR
        BooleanQuery cityStateQuery = null;
        queryCity = null;
        queryState = null;

        //Nearest Cities Query
        foreach (string city in nearestCities.Keys)
        {

            BooleanQuery tempFinalQuery = finalQuery;
            cityStateQuery = new BooleanQuery();
            queryCity = queryParserCity.Parse(city);
            queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);
            cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND
            cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
            nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR
        }

        finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);
    }

Thanks.

A: 

I might have missed the point of your question but do you have the possibility of storing latitude and longitude for zip codes? If that is an option, you could then compute the distance between two coordinates providing a much more straightforward scoring metric.

Sugerman
Could you please have a look at this and comment??Thanks.http://stackoverflow.com/questions/1052086/spatialquery-for-location-based-search-using-lucene
+2  A: 

Not quite sure if I completely understand your code, but when it comes to geospatial search a filter approach might be more appropriate. Maybe this link can give you some ideas - http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Maybe you can use Filters for other parts of your query as well. To be honest your query looks quite complex.

--Hardy

Hardy
Could you please have a look at this and comment??Thanks.http://stackoverflow.com/questions/1052086/spatialquery-for-location-based-search-using-lucene
A: 

I believe the best approach is to move the the nearest city determination into a search filter. I would also reconsider how you have the field setup; consider creating one term that has city+state so that would simplify the query.

Aaron Saunders
A: 

I'd suggest:

  • storing the latitude and longitude of locations as they come in
  • when a user enters a city and distance, turn that into a lat/lon value and degrees
  • do a single, simple lookup based on numerical distance lat/lon comparisons

You can see an example of how this works in the Geo::Distance Perl module. Take a look at the closest method in the source, which implements this lookup via simple SQL.

Anirvan
A: 

Agree with the others here that this smells too much. Also doing a textual search on city names is not always that reliable. There is often a bit of subjectivity between place names (particularly areas within a city which might in themselves be large).

Doing a geo spatial query is the way to go. Not knowing the rest of your set up it is hard to advise. You do have Spatial support built into Fluent to NHibernate, and SQL Server 2008 for example. You could then do a search very quickly and efficiently. However, your challenge is to get this working within Lucene.

You could possibly do a "first pass" query using spatial support in SQL Server, and then run those results through Lucene?

The other major benefit of doing spatial queries is that you can then easily sort your results by distance which is a win for your customers.

Perhentian