tags:

views:

66

answers:

3

Hi,

I'm looking for an open-source web search library that does not use a search index file. Do you know any?

Thanks, Kenneth

+1  A: 

The original poster clarified in a comment to this reply that what he is looking for is essentially "greplike search but through HTTP", and mentioned that he is looking for something that uses little disk as he's working with an embedded system.

I am not aware of any related projects, but you might want to look at html parsers and xquery implementations in your language of choice. You should be able to take care of "real-life" messiness of html with the former, and write a search that's almost as detailed as you might desire with the latter.

I assume that you will be working with a set of urls that will either be provided, or already stored locally, since the idea of actually crawling the whole web, discovering links, etc, in an embedded device is thoroughly unrealistic.

Although with a good html/xquery implementation, you do have the tools to extract all the links..

My original answer, which was really a request for clarification:

Not sure what you mean. How do you picture a search working without an index? Crawling the web for every query? Piping through to google? Or are you referring to a specific kind of search index file that you are trying to avoid?

SquareCog
>> How do you picture a search working without an index? I picture it as grep-like search but through HTTP.>> Crawling the web for every query?Yes.>> Piping through to google?NoI'm avoiding creating an index file as disk space is scarce as in an embedded environment.
ksuralta
A: 

You mean:

search.cgi

#/bin/sh
arg=`echo $QUERY | sed -e 's/^s=//' -e 's/&.*$//'`
cd /var/www/httpd
find . -type f | xargs egrep -l "$arg" | awk 'BEGIN { 
        print "Content-type: text/html"; 
        print "";
        print "<HTML><HEAD><TITLE>Search Result</TITLE></HEAD>";
        print "<BODY><P>Here are your search results, sorry it took so long.</P>";
        print "<UL>";
    }
    { print  "<LI><A HREF=\"http://yourhost.com/" $1 "\">" $1 "</A></LI>"; }
    END {
        print "</UL></BODY>";
    }'

Untested...

Will Hartung
Hmm.. something like that.. but, a more refined-version :)As it would be slow as expected, I'm thinking of showing partial results while the user is waiting.
ksuralta
A: 

I guess there is none (at least that is popular enough for users here to be aware of).

We've went ahead to code our own Search system.

ksuralta