tags:

views:

26

answers:

2

Hi guys,

I want to index some articles and show the paragraph number in the search result. So I guess the solr schema should looks like this:

article_id, paragraph_number, paragraph_content

Therefore, I need to parse article first, extract paragraphs and index it one by one.

I'm worried about the performance since one article can contain 100 paragraphs.

Any suggestion?

+1  A: 

It is better to do the heavy lifting at index time rather than search time. So parsing the paragraphs out of the document when you index is probably the right way to go.

How many articles do you have? It really shouldn't be a problem to strip paragraphs (we do much more complex pre-processing that that).

leonm
Thank you leonm, 10,000 articles at least, so the paragraphs might be 10,000 * 100.I guess there should be no problem:)
Ke
Comes down to 1 million entries. Should be pretty straight forward. Keep an eye out for the code that splits the paragraphs out - you want to make it as efficient as possible.
leonm
A: 

If you only need to match individual paragraphs against the fulltext query (as opposed to filters etc.), you could also do this using highlighting -- split up the paragraphs, prefix each one with its paragraph number, and then index the paragraphs as multiple values in a single field in a single document. At search time, you'd do a highlight on the field with a full match (e.g. fragment size of -1) and no decoration of the highlight; so what you'd get back is the paragraph that matched the fulltext query, prefixed by its paragraph number (which you'd probably want to then pull back out).

Not sure if this fits your use case exactly but might be an interesting approach to try -- I do something similar to identify photos whose caption matches the fulltext query to display next to article search results.

outoftime
Thank you, outoftime.Matching individual paragraphs against the fulltext query is exactly what I want. But I am new to solr, cloud please explain it more?
Ke