views:

266

answers:

4

Is there any open-source library that can be used to search the Deep Web?

+1  A: 

If Google is not able to index any of these pages, what makes you think an open-source library will be able to do it? :)

That said, there are some links in your article with regard to crawling the deep web that may be a good place to start investigating. Here are some others:

Justin Ethier
Google's focus is not Deep Web - I don't question potential ability but rather fitness for purpose. Deep web is a rather vast resource for illicit pieces of information, regarding munitions and various other topics that would not be appropriate for Google to index, no matter the level of "safe search" they would be categorized as belonging to. By "open-source" I mean rather hack-ish repository initiatives, queryable through some sort of API.
luvieere
Munitions, illicit information... what exactly are you trying to do here?
Justin Ethier
+1  A: 

Very interesting question (+1), but I'm afraid you'll just have to write it by yourself (I hope you can prove me wrong, though).

Phil
+2  A: 

Dear luvieere there is an Open Archives Initiative Protocol for Metadata Harvesting which uses xml over html . you can find it at : http://www.openarchives.org/Register/BrowseSites

Also The deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by standard search engines.

Commercial search engines have begun exploring alternative methods to crawl the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web. Google's deep Web surfacing system pre-computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep Web content. In this system, the pre-computation of submissions is done using three algorithms:

(1) selecting input values for text search inputs that accept keywords,

(2) identifying inputs which accept only values of a specific type (e.g., date), and

(3) selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.

Nasser Hadjloo