tags:

views:

42

answers:

1

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.


Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.

Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.

So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.


I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)


Update:

It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like

sort_path:["/A/B/C" TO *]

to get the "next" N sections, and do

sort_path:[* TO "/A/B/C"]

ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.

+1  A: 

This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.

In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.

Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Mauricio Scheffer
+1 indexing a whole document is probably the easiest way to go if it feasible in your case
Pascal Dimassimo
Thanks Mauricio, some good thoughts. The problem with indexing the entire document is that I want to be able to present smaller subsets of the documents to the user in the UI, because some of these documents are thousands of pages long. I wanted to store divs per paragraph, and be able to present them to the user "in piecemeal", but like you say, it's a general problem when paginating.
Dan Fitch
@Dan Fitch: what about highlighting?
Mauricio Scheffer
When I say in piecemeal, I mean that I want to use Solr to store actual markup for the sections, not just the plaintext content, and then there is more than just searching this corpus: there needs to be a way to browse the entire "document" by assembling a "subset" of the sections into a viewable chunk.Sorry this isn't particularly clear, it's not super clear in my head yet either. :)
Dan Fitch