Solr search and automated web publishing - can they work together?

I'm dealing with an existing web platform that uses SOLR to generate query-based datasets. We have an issue with near real-time (< 1 minute) publishing of new content. There is a caching mechanism in place to help reduce resource load on the SOLR servers, but this caching introduces a lag time in the appearance of new content in SOLR-query based datasets.

I'd like to be able to invalidate the cache based on the SOLR query that generated a cached item, but I've run into a stumbling block: with 1000+ SOLR queries, it's difficult to know which (if any) of them apply to a given document. The approaches we've identified so far include:

Instantiate a SOLR instance, push a single document in at a time, and run the queries to see which hit.
Build an in-memory Lucene index, and do the same thing.
Use some other technique (hand-rolled parsing of the SOLR query) to get a rough estimate of which queries are affected.

None of these is really ideal, but without some way to "turn around" the process and run the document through the queries CEP style, I'm not sure there's a better way.

Has anyone dealt with a similar situation?

That looks great, but I think my bigger issue is around the addition of data to the Solr index. I have steady additions, sometimes with potentially large streams of content. What's most painful about this is the addition of these items to the Solr index and the reindexing needed to maintain optimal performance. If I were to do a HEAD request to test caching for every piece of content, I'm afraid I might bring the server to its knees rather rapidly.

Harper Shelby 2010-08-17 18:03:08

@Harper: you don't need to send HEAD requests to test caching, that's not how HTTP caching works. See http://www.grabner-online.de/div_into/html/ch11s03s04.html

Mauricio Scheffer 2010-08-17 18:46:37

@Harper: this way you delegate caching expiration to Solr, where it belongs.

Mauricio Scheffer 2010-08-17 18:50:00

What's being cached isn't strictly the Solr result though - the Solr result(s) are being pulled together at the next level, which is where the caching occurs.

Harper Shelby 2010-08-17 21:10:32

@Harper: there's the problem. If you don't let Solr handle caching expiration, you will complicate things a lot.

Mauricio Scheffer 2010-08-17 21:17:51

ansaurus

tags:

views:

answers:

Solr search and automated web publishing - can they work together?

related questions