tags:

views:

445

answers:

2

I need to run a JOIN query on a solr index. I've got two xmls that I have indexed, person.xml and subject.xml.

Person:

<doc>
<field name="id">P39126</field>
<field name="family">Smith</field>
<field name="given">John</field>
<field name="subject">S1276</field>
<field name="subject">S1312</field>
</doc>

Subject:

<doc>
<field name="id">S1276</field>
<field name="topic">Abnormalities, Human</field>
</doc>

I need to only display information from the person doc but each query should match fields in both person and subject. In the case the query matches only the subject doc I need to display all docs from the person that have a matching id. Is this possible to do without running two seperate queries? Something like a JOIN query would do the job.

Any help?

A: 

I do not think it is possible to do what you are asking with a single query using your schema.

One thing that you should keep in mind is to always think of Solr indexes as single denormalized tables. This is sometimes a challenge and there may be times where you must be forced to use different indexes for each kind of data.

For your problem, maybe having a schema like this one might help:

<doc>
 <field name="id">P39126</field>
 <field name="family">Smith</field>
 <field name="given">John</field>
 <field name="topic">Abnormalities, Human</field> <!-- subject S1276 -->
 <field name="topic">some, other, topics</field> <!-- subject S1312 -->
</doc>

Running a query for some topics with this schema would return all person having those topics.

Some links that might interest you:

Pascal Dimassimo
Thanks very much Pascal. I don't know about changing the schema really. We've got some quite big XML files to index (about 4) each one with it's own schema having IDs that connect one another. Making all of these one big xml means a lot of repetition and an enormous xml file. I don't know what's better, performance-wise. Make some extra queries or have one huge xml with a LOT of repetition.
Sfairas
I would tend to say that, in most cases, you should not worry about the repetitions and strive to have your main entity completely denormalized. The Lucene index is quite good for that. At query time, you can always use the fl parameter to limit what is gonna be returned to the users.
Pascal Dimassimo
A: 

If you can't denormalize as suggested by Pascal, you could write your own query handler to do the join: first issue a query on the requested topics that requests the id field of matching documents, then issue a BooleanQuery containing one clause for each id (a TermQuery on subject = id). This will have pretty poor performance if there are a large number of id's, but should be fine if there are just a few matching ids.

If you anticipate that your "join" queries will generally match a lot (say hundreds) of subjects, then you're probably better off denormalizing as suggested.

I don't know the most elegant way to issue a query from a handler, but FWIW here's how I do it.

Map args = new HashMap();
// add your query parameters to the map, like fields to return
args.put("fl", new String[]{"id"});
final SolrIndexSearcher searcher = req.getSearcher();
String query = "your query"
LocalSolrQueryRequest newReq = new LocalSolrQueryRequest(core, query, "", 0, 0, args) {
  @Override public SolrIndexSearcher getSearcher() { return searcher; }
  @Override public void close() { }
};
SolrQueryResponse newRsp = new SolrQueryResponse();
core.execute(core.getRequestHandler(newReq.getParams().get(CommonParams.QT)), newReq, newRsp);
// query results will be in newRsp
Dallan Quass
Thanks very much! In my case I think going this way is not really an option. My dataset is huge and will have quite a few matching IDs that would reduce performance. I'm now investigating using SOLR's datahandler which might do the job. Will have to build a database though.
Sfairas