I am a Symfony developer and my web server is Linux. I already use the sfLucene plugin.
What is the simplest way of indexing PDF files for search on a Linux PHP server?
XPDF, installed like this
Apache Tika via the SOLR sfLucene plugin branch
A 3rd option?
Thanks!
...
I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this:
<dataConfig>
<dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
<dataSource name="ds-file" t...
I am using SOLR's ExtractingRequestHandler to ingest the text of documents.
The examples in the documentation all use curl to stream documents, like so:
curl 'http://.../extract?literal.id=doc1&commit=true' -F "[email protected]"
That works just fine, but there is this note:
using "curl" or other command line
tools to p...
I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA].
the fie name is something like sitemap01.xml.gz
I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml...
I searched Google for past two days.
D...
Hi All
I want to index a large number of pdf documents i have found a reference that it could be done by apache tika but unfortunately did not found any refernce how could I configure apache tika with solr 1.4.1.
and one other question is how to send documents to solr directly without the use of curl i m using solrnet for/indexing
Reg...
Hi All
I have configured Extract request handler with solr and now when i submit some pdf document to solr using curl it generates following error
Document [NULL] missing required field DocID
my schema is like
<fields>
<field name="DocID" type="string" indexed="true" stored="true"/>
<field name="Contents" type="text" indexed="true"...
Can you use ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?
I am sending solr the archived.tar file using curl. curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true"
-H 'Content-type:application/...