Is there any open-source library that can be used to search the Deep Web?
views:
266answers:
4If Google is not able to index any of these pages, what makes you think an open-source library will be able to do it? :)
That said, there are some links in your article with regard to crawling the deep web that may be a good place to start investigating. Here are some others:
- Deep Web Research has a LOT of helpful references.
- deepwebtech.com claims to have a deep web search engine, although it is down at the moment.
Very interesting question (+1), but I'm afraid you'll just have to write it by yourself (I hope you can prove me wrong, though).
Dear luvieere there is an Open Archives Initiative Protocol for Metadata Harvesting which uses xml over html . you can find it at : http://www.openarchives.org/Register/BrowseSites
Also The deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by standard search engines.
Commercial search engines have begun exploring alternative methods to crawl the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web. Google's deep Web surfacing system pre-computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep Web content. In this system, the pre-computation of submissions is done using three algorithms:
(1) selecting input values for text search inputs that accept keywords,
(2) identifying inputs which accept only values of a specific type (e.g., date), and
(3) selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.