Is there any open-source library that can be used to search the Deep Web?

A:

Here are a few references:

tvanfosson 2009-11-30 22:03:19

+1 A:

If Google is not able to index any of these pages, what makes you think an open-source library will be able to do it? :)

That said, there are some links in your article with regard to crawling the deep web that may be a good place to start investigating. Here are some others:

Deep Web Research has a LOT of helpful references.
deepwebtech.com claims to have a deep web search engine, although it is down at the moment.

Justin Ethier 2009-11-30 22:05:13

Google's focus is not Deep Web - I don't question potential ability but rather fitness for purpose. Deep web is a rather vast resource for illicit pieces of information, regarding munitions and various other topics that would not be appropriate for Google to index, no matter the level of "safe search" they would be categorized as belonging to. By "open-source" I mean rather hack-ish repository initiatives, queryable through some sort of API.

luvieere 2009-11-30 22:11:48

Munitions, illicit information... what exactly are you trying to do here?

Justin Ethier 2010-02-20 19:11:16

+1 A:

Very interesting question (+1), but I'm afraid you'll just have to write it by yourself (I hope you can prove me wrong, though).

Phil 2009-12-03 20:47:42

+2 A:

Dear luvieere there is an Open Archives Initiative Protocol for Metadata Harvesting which uses xml over html . you can find it at : http://www.openarchives.org/Register/BrowseSites

Also The deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by standard search engines.

Commercial search engines have begun exploring alternative methods to crawl the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web. Google's deep Web surfacing system pre-computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep Web content. In this system, the pre-computation of submissions is done using three algorithms:

(1) selecting input values for text search inputs that accept keywords,

(2) identifying inputs which accept only values of a specific type (e.g., date), and

(3) selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.

Nasser Hadjloo 2010-02-17 11:59:11

ansaurus

tags:

views:

answers:

Is there any open-source library that can be used to search the Deep Web?

related questions