views:

855

answers:

6

I've got nutch and lucene setup to crawl and index some sites and I'd like to use a .net website instead of the JSP site that comes with nutch.

Can anyone recommend some solutions?

I've seen solutions where there was an app running on the index server which the .Net site used remoting to connect to.

Speed is a consideration obviously so can this still perform well?

Edit: could NHibernate.Search work for this?

Edit: We ended up going with Solr index servers being used by our ASP.net site with the solrnet library.

+3  A: 

Instead of using Lucene, you could use Solr to index with nutch (see here), then you can connect very easily to Solr using one of the two libraries available: SolrSharp and SolrNet.

Mauricio Scheffer
looks really good, will it be able to take my lucene indexes?
Scott Cowan
Haven't tried, but it should... trying it is the only way to be sure :)
Mauricio Scheffer
I'm looking at hadoop compatibility too
Scott Cowan
Hadoop is java-only AFAIK, and I don't know its interoperability with other platforms...
Mauricio Scheffer
I'm running everything on debian anyways even asp.net
Scott Cowan
A: 

Instead of using Solr, I wrote a java based indexer that runs in a cron job, and a java based web service for querying. I actually didn't index pages so much as different types of data that the .net site uses to build the pages. So there's actually 4 different indexes each with a different document structure that can all be queried in about the same way (say: users, posts, messages, photos).

By defining an XSD for the web service responses I was able to both generate classes in .net and java to store a representation of the documents. The web service basically runs the query on the right index and fills out the response xml from the hits. The .net client parses that back into objects. There's also a json interface for any client side JavaScript.

dlamblin
A: 

SearchBlackBox Luca.Net is a commercial Apache Lucene compatible full-text search API for .NET. It allows you to provide Lucene-powered solutions for .NET.

gimel
Good solution but its out of our budget, we're just a poor startup that can't afford 3500 for a interop library
Scott Cowan
link is currently broken
Mauricio Scheffer
+3  A: 

In case it wasn't totally clear from the other answers, Lucene.NET and Lucene (Java) use the same index format, so you should be able continue to use your existing (Java-based) mechanisms for indexing, and then use Lucene.NET inside your .NET web application to query the index.

From the Lucene.NET incubator site:

In addition to the APIs and classes port to C#, the algorithm of Java Lucene is ported to C# Lucene. This means an index created with Java Lucene is back-and-forth compatible with the C# Lucene; both at reading, writing and updating. In fact a Lucene index can be concurrently searched and updated using Java Lucene and C# Lucene processes

Winston Fassett
how about using it with hadoop?
Scott Cowan
How do you want to combine Lucene with Hadoop? Index data that's already in Hadoop? Store a distributed lucene index in Hadoop? The latter would probably require a special version of lucene in order to distribute/query, but maybe someone's tried to do it, but probably in java.
Winston Fassett
+1  A: 

I'm also working on this.

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

It seems you can submit your query to nutch and get the rss results back.

edit:

Got this working today in a windows form as a proof of concept. Two textboxes(searchurl and query), one for the server url and one for the query. One datagrid view.

private void Form1_Load(object sender, EventArgs e)
        {
            searchurl.Text = "http://localhost:8080/opensearch?query=";


    }

    private void search_Click(object sender, EventArgs e)
    {
        string uri;

        uri = searchurl.Text.ToString() + query.Text.ToString();
        Console.WriteLine(uri);

        XmlDocument myXMLDocument = new XmlDocument();

        myXMLDocument.Load(uri);

        DataSet ds = new DataSet();

        ds.ReadXml(new XmlNodeReader(myXMLDocument));

        SearchResultsGridView1.DataSource = ds;
        SearchResultsGridView1.DataMember = "item";

    }
Sam
well done, We're starting to use Solr for this
Scott Cowan
And it seems our division is probably going with windows search server express.
Sam
A: 

Why not switch from java lucene to the dot net version. Sure it's an investment but it's mostly a class substitution exercise. The last thing you need is more layers that add no value other than just being glue. Less glue and more stuff is what you should aim for...

mP
lucene.net has no Hadoop provider which is why we're on solr now
Scott Cowan