views:

812

answers:

3

I basically want Solr to search each record of the multivalued field for my search parameter.. read on for my example:

I am using Solr to index my data. I have application data in parallel arrays (in the form of multi-valued fields) that match a given product. See the following example, where make, model, and year are multivalued fields:

<-solr record start->
sku: 1234
make: acura, acura, acura
model: integra, rsx, rsx
year: 1997, 2004, 2000
engine: 3.4, 4.5, 4.5
<-solr record end->

I am using filter queries (&fq=) to narrow my selections. The problem is, if someone looks up a 2000 Acura Integra, it will match the above record, but since the make, model, and year data is encoded in parallel, there actually is no 2000 Acura Integra for this product. Solr is matching the make in the make field, the model in the model field, and the year in the year field (as it should) and returning this result, and not respecting my parallelism. My Query would look like this so far:


fq=make:"acura"&fq=model:"integra"&fq=year:2000 (I would normally escape URL characters when I POST to Solr, this is just an example)

So my solution was to create another multivalued field, called summary field,in which I would put all the make, model, year and other data (like engine) together separated by a space. It is necessary to have quotations around the words so terms with multiple words don't match search parameters inadvertently. The above example would now look like this:

<-solr record start->
sku: 1234
make: acura, acura, acura
model: integra, rsx, rsx
year: 1997, 2004, 2000
engine: 3.4, 4.5, 4.5
summary: "acura" "integra" "1997" "3.4", "acura" "rsx" "2004" "4.5", "acura" "rsx" "2000", "4.5"
<-solr record end->

I then add to my query the following:

summary:(""acura" AND "integra" AND "2000")

I would expect, if I added that to my query, that this record would no longer come up, since there is no acura integra 2000 in the summary field. However, this doens't work. The record still comes up. I am stumped. Does anyone have a solution to this problem. It's been killing me for days.

I basically want Solr to search each record of the multivalued field for my search parameter.. is this possible? Is there a better way to do what I am trying to do?

Thanks

A: 

It seems that your schema isn't quite right. You need to completely denormalize your data and create one document per vehicle. What a "vehicle" means depends on what kind of searches you will run. For example, a possible schema would be:

sku: 1234
make: acura
model: integra
years: 1997
engines: 3.4, 4.5

sku: 1235
make: acura
model: rsx
years: 2000, 2004
engines: 4.5

The summary field would be a copyField of make+model+years+engines

Mauricio Scheffer
Hey. This was my original solution exactly, and it works perfectly! The only problem is, when you go to the product screen of my site, you are viewing the number of applications instead of the number of products. If the same make, model, and year belongs to a sku, and only the engine or submodel is different, I simply state that in a "fits" section below the product. For Example, I might roll 4 applications together into 1 product "box" on my site. The search results should say viewing 1 of 1. Instead it says viewing 4 of 4 (even though there is one box on the screen). Hence, my new schema...
Dan
@Dan: take a look at field collapsing: http://wiki.apache.org/solr/FieldCollapsing
Mauricio Scheffer
Because the issue is a SKU (12345) can fit multiple vehicles. having each SKU + vehicle one as its own record is nice, but viewing them as a customer is horrible. By rolling up each vehicle (and applications) and attaching it to the same SKU that it fits, it makes it more viewable. Howver, you might be viewing 10 applications per page, but you only have 3 "boxes" in which a user can buy something. SO it says Viewing Items 1 through 10, but only 3 "boxes" with pictures and a "click to buy button" is listed.
Dan
Sorry Dan, but those questions are really specific of your particular project, I can't answer them.
Mauricio Scheffer
It's OK. Thanks for the help anyway. I will take a look at the Field Collapsing
Dan
A: 

Can you not just do a query as follows?

make:acura AND model:integra AND year:2000

I.e. Without the Quotes around the make and model.

CraftyFella
A: 

I am still not sure on how to maintain parallelism without a summary field, but I figured out how to do it with a summary field. Instead of using AND statements, which I believe search each record in the multivalued field for a match (each AND'ed term could match a different row in the Multivalued field, not necessarily the same row), you instead put the exact terms you're looking for, in the same order that you built your original summary record, and use the ~ operator.

Take a look at the following example:

The following are the contents of the summary field in one of the rows in the multivalued field, which I wish to match: "Honda" "Accord" "2004" "3.5L"

Here is the query I will run: summary_field:("\"Honda\" \"2004\"")

The above query alone will not work. Even though I can have a function that puts user input from the application into the same order that the original summary field was built with, because users in the application can enter a piece of data (a make, model year) in any order, there may be other words in between the data I am trying to match. In the above eample, I want to match Honda 2004 to that record. However, Accord is between it.

To get around this problem, simply use the ~n operator, where n is the maximum number of other terms in between the terms your are searching for. So if I instead use:

summary_field:("\"Honda\" \"2004\""~1)

I am saying that between Honda and 2004, there is a possibility of there being 1 other word. Therefore, this above query will match. Even if you add multiple terms to your summary field, as long as you query against it with the values in the same order, and your fuzzy search logic uses a number that will be the maximum distance between 2 values, your query will always correctly match the correct summary field. So, if you have 20 fields that you add to your summary field to maintain parallelism, you simply need to use ~18, as that is the maximum possible distance in a worst case scenario between words that could be picked by the user.

Dan