views:

342

answers:

5

Hi,

I'm looking for an open source search indexing library. It will be used for embedded web application so it should have a small code size. Preferably, written in C, C++ or PHP and does not require any database to be installed for storing indexes. Indexes should be stored on a file instead (e.g., xml, txt). I tried to look on some famous search libraries such as xapian and clucene, they're good but have a relatively large code size for an embedded system.

This will be run on a Linux platform and will be used to index HTML files.

Any thoughts on what would be a good search library/API to use?

Thanks.

A: 

First: you have to store indexes somewhere. So a data file will be needed unless you want memory only indexes.

To index generic items, I can recommend you sqlite: http://www.sqlite.org/. I even use it in memory only mode when I have a bunch of data and I need to handle it with multiple indexes.

graffic
+2  A: 

Hyper Estraier.

eed3si9n
+2  A: 

Oh, man. There's a few. In order of descending obscurity...

I'm sure there's a ton more out there, but these are the ones I have off the top of my head. Good luck :)

Aeon
+1 for Ferret. It's like Lucene, but written in C (with a ruby frontend) and much faster
zenazn
A: 

It depends on your requirements. A full distribution of Lucene (Java) is up to 3MB JAR file, but in practice can be stripped down to well under 1MB. CLucene is probably considerably smaller in practice. How low do you need to go?...

Tomer Gabel
We ruled out Lucene because we couldn't have JRE on our system. When I look on CLucene, it's around 20MB, though it can still be stripped down. it's extremely large for our system. I think we can go up to 2MB, at most.
teriz
In that case I'm afraid I don't really have a recommendation for you, but I suggest you add this information to your question for future reference.
Tomer Gabel
A: 

Swish-E is written in C and might do what you want. Does not require a database, uses its own binary index file format.

I've also used ht://Dig but it looks like it's been a long time since that software was maintained.

Both will compile on Linux and index HTML just fine.

A third option is SINO used by AustLII. Contact the team there to make sure you get the latest version. Should compile on Linux without too much trouble. It's not really designed for embedded systems (SINO stands for Size Is No Object) but had a decent API on it last I looked and relatively small (so, it's not designed for it but might work just as well). Targeted at HTML. Pretty fast indexing. Worth a look I think. (Disclosure: worked there a long time ago)

Finally, we use Solr which is based on Lucene. Solr uses a simple API based on POSTing XML documents to a server. Pretty simple to interface with no matter what your language.

Hissohathair