tags:

views:

50

answers:

2

I'm dealing with an existing web platform that uses SOLR to generate query-based datasets. We have an issue with near real-time (< 1 minute) publishing of new content. There is a caching mechanism in place to help reduce resource load on the SOLR servers, but this caching introduces a lag time in the appearance of new content in SOLR-query based datasets.

I'd like to be able to invalidate the cache based on the SOLR query that generated a cached item, but I've run into a stumbling block: with 1000+ SOLR queries, it's difficult to know which (if any) of them apply to a given document. The approaches we've identified so far include:

  1. Instantiate a SOLR instance, push a single document in at a time, and run the queries to see which hit.
  2. Build an in-memory Lucene index, and do the same thing.
  3. Use some other technique (hand-rolled parsing of the SOLR query) to get a rough estimate of which queries are affected.

None of these is really ideal, but without some way to "turn around" the process and run the document through the queries CEP style, I'm not sure there's a better way.

Has anyone dealt with a similar situation?

+1  A: 

Solr emits ETags for all query responses, and honors standard HTTP cache request headers like If-None-Match, If-Match, etc. See Solr And HTTP Caches

So it's a matter of coordinating your cache system around this.

Mauricio Scheffer
That looks great, but I think my bigger issue is around the addition of data to the Solr index. I have steady additions, sometimes with potentially large streams of content. What's most painful about this is the addition of these items to the Solr index and the reindexing needed to maintain optimal performance. If I were to do a HEAD request to test caching for every piece of content, I'm afraid I might bring the server to its knees rather rapidly.
Harper Shelby
@Harper: you don't need to send HEAD requests to test caching, that's not how HTTP caching works. See http://www.grabner-online.de/div_into/html/ch11s03s04.html
Mauricio Scheffer
@Harper: this way you delegate caching expiration to Solr, where it belongs.
Mauricio Scheffer
What's being cached isn't strictly the Solr result though - the Solr result(s) are being pulled together at the next level, which is where the caching occurs.
Harper Shelby
@Harper: there's the problem. If you don't let Solr handle caching expiration, you will complicate things a lot.
Mauricio Scheffer
A: 

I think the standard way is to make an "index" out of the single changed document (using a memory index). You then run your thousands of queries on this index, and if the query matches, you invalidate the cache for that query. Since the index is so small and is entirely in memory, it's very fast.

Xodarap