views:

36

answers:

1

Hello,

I have been trying to research how solr works when documents like doc or pdf are submitted to it. I want to know if I submit pdfs to solr, does it end up storing the pdf file also along with the index generated after parsing the pdf file?

Thanks,

-Keshav

+2  A: 

Solr (Lucene) doesn't "end up store the PDF file" itself. However it can store the text contents of the PDF extracted from the PDF using a text-extractor such as Tika (if indeed the field is marked as stored in the schema).

If you wish to store the PDF file in its entirety you will need to convert the PDF into (for example) Base64 representation and persist the base64 string as a "Stored" field. So when you access the doc you convert back from Base64 to PDF.

Mikos
Or, save the pdf to the filesystem and save its location in a "Stored" field.
R Ubben
Mikos, Thanks for your response! You mentioned that text contents of the PDF can be stored. But is the text storage necessary for the index search to work?
Keshav
Not necessary for the searching. But if you need highlighting (snippets), then you will need to store.
Mikos