Apache Solr - are the documents itself stored internally apart from the index? | ansaurus

tags:

views:

36

answers:

1

Q:

Apache Solr - are the documents itself stored internally apart from the index?

Hello,

I have been trying to research how solr works when documents like doc or pdf are submitted to it. I want to know if I submit pdfs to solr, does it end up storing the pdf file also along with the index generated after parsing the pdf file?

Thanks,

-Keshav

+2 A:

Solr (Lucene) doesn't "end up store the PDF file" itself. However it can store the text contents of the PDF extracted from the PDF using a text-extractor such as Tika (if indeed the field is marked as stored in the schema).

If you wish to store the PDF file in its entirety you will need to convert the PDF into (for example) Base64 representation and persist the base64 string as a "Stored" field. So when you access the doc you convert back from Base64 to PDF.

Mikos 2010-08-06 18:27:02

Or, save the pdf to the filesystem and save its location in a "Stored" field.

R Ubben 2010-08-06 18:31:09

Mikos, Thanks for your response! You mentioned that text contents of the PDF can be stored. But is the text storage necessary for the index search to work?

Keshav 2010-08-06 18:36:31

Not necessary for the searching. But if you need highlighting (snippets), then you will need to store.

Mikos 2010-08-06 18:45:22

related questions

Lucene.Net Search result to highlight search keywords

Does a pom.xml.template tell me everything I need to know to use the project as a dependency

Can someone compare a Fuzzy Query to a LuceneDictionary solution?

Has anyone used lucene.net with Linq-to-Entities?

Can someone give me a high overview of how lucene.net works?

Using Lucene to count results in categories

Which search technology to use with ASP.NET?

How to do query auto-completion/suggestions in Lucene?

Should an index be optimised after incremental indexes in Lucene?

What is the best search approach using Lucene?

How to best search against a DB with Lucene?

Is there a fast, accurate Highlighter for Lucene?

How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?

How do I estimate the size of a Lucene index?

Analyzer for Russian language in Lucene and Lucene.Net

In Lucene how do terms get used in calculating scores, can I override it with a CustomScoreQuery?

Troubleshoot Java Lucene ignoring Field

Best full text search alternative to ms sql, c++ solution

Strategies for keeping a Lucene Index up to date with domain model changes

How to get facet ranges in solr results?

Using Lucene to search for email addresses

WildcardQuery error in Solr

With Lucene: Why do I get a Too Many Clauses error if I do a prefix search?

Lucene exact ordering

Lucene Score results