views:

194

answers:

2

Hi,

I have a Solr box which is fed by a PHP cronjob right now.

I want to speed things up and save some memory by switching to a C++ process.

I don't want to reinvent the wheel by creating a new library.

The only thing is I can't find a library for Solr in C++.

Otherwise I will have to create one using CURL probably.

Does any of you guys know a library between for Solr written in C++?

Thanks.

A: 

With "fed" do you mean documents are passed for indexing? You'll probably find that the process that is doing the "feeding" is not the bottleneck but rather how quickly Solr can ingest documents.

I'd also recommend some profiling before you do a lot of work because the process is usually not CPU bound, so the speed increase you'll get by moving to C++ will disappointing.

leonm
After some early result, Solr seems to be the bottleneck! Any solution to that? Should I switch to CLucene (doesn't seems to be maintained anymore)
stunti
Quite often switching to C or C++ does not give you the speed increase that you might expect. Check the Solr docs on speeding things up. The fist thing that comes to mind is submitting multiple <doc>s in a single <add>
leonm
A: 

Have you optimised your schema as much as possible? The two obvious first steps are: 1. Don't store data that is not needed for display.(Field ID's and meta data etc) ...and the opposite of that... 2. Don't index data is ONLY used for display, but not searched for. (Supplementary data)

And a quirky thing to try that sometimes works, and sometimes doesn't is changing the add/overwrite attribute to false.

<add overwrite="false">

This disables the unique id check (I think). So if you are doing a full wipe/replace of the index, and you are certain to only be adding unique documents, then this can speed up the import. It really depends on the size of the index though. If you have over 2,000,000 documents, and every time the indexer adds a new one, you gain a bit of speed by not forcing it to check if that document already exists. Not the most eloquent explanation, but I hope it makes sense.

Personally, I use the data import handler which cuts out the need for an intermediary script. It just hooks up to the db and sucks out the info it needs with a single query.

jspash