views:

16

answers:

0

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts", but the content of those files is not extracted or included. This is not the behavior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example. When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the "body_texts" field. Am I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to successfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.