tika

Indexing PDF files with Symfony using Lucene

I am a Symfony developer and my web server is Linux. I already use the sfLucene plugin. What is the simplest way of indexing PDF files for search on a Linux PHP server? XPDF, installed like this Apache Tika via the SOLR sfLucene plugin branch A 3rd option? Thanks! ...

Solr's TikaEntityProcessor not working

I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this: <dataConfig> <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/> <dataSource name="ds-file" t...

Ways to send binary/structured documents to SOLR?

I am using SOLR's ExtractingRequestHandler to ingest the text of documents. The examples in the documentation all use curl to stream documents, like so: curl 'http://.../extract?literal.id=doc1&amp;commit=true' -F "[email protected]" That works just fine, but there is this note: using "curl" or other command line tools to p...

Extract xml data from gzip file using apache tika?

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA]. the fie name is something like sitemap01.xml.gz I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml... I searched Google for past two days. D...

How to configure Apache Tika with apache Solr 1.4.1

Hi All I want to index a large number of pdf documents i have found a reference that it could be done by apache tika but unfortunately did not found any refernce how could I configure apache tika with solr 1.4.1. and one other question is how to send documents to solr directly without the use of curl i m using solrnet for/indexing Reg...

Question related to solr ExtractRequestHandler?

Hi All I have configured Extract request handler with solr and now when i submit some pdf document to solr using curl it generates following error Document [NULL] missing required field DocID my schema is like <fields> <field name="DocID" type="string" indexed="true" stored="true"/> <field name="Contents" type="text" indexed="true"...

Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats.

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing? I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&amp;fmap.content=body_texts&amp;commit=true" -H 'Content-type:application/...