views:

55

answers:

2

Hi all, I am relatively new to the wonderfulworld of Solr and have the following question. What is the best way to process documents in terms of extracting the document structure and passing it onto Solr for indexing.

I would like to be able to extract the text from Word Docs, PDF's, Spreadsheets, HTML pages etc. In fact virtually any document that contains text.

I have taken a look at Windows Filters and at first glance they seem to provide the functionality I require.

Is this how you would do it?

sime

A: 

You probably want to look at the Solr Cell project. I'm assuming you're using the c# client - but you will probably need to do all the content extraction/mapping for the server with java tools.

The Solr Cell page has instructions on how to use Apache Tika, which can wrap libraries that extract text (and some metadata) from a wide variety of formats, like Word or PDF.

Philip Rieck
A: 

As Philip said, SolrCell is the standard way to index these binary document types. However, it's still not supported by SolrNet, so your options are:

  1. Implement it and contribute it to the project, or
  2. Work around it, create your own http requests to send to Solr, avoiding SolrNet for that particular functionality.

Also, some users preferred iTextSharp / Aspose instead of SolrCell due to performance issues.

Mauricio Scheffer