ansaurus

Question

How to search a particular type of web addresses?

Answer 1

+1 A:

Usually grep -E "http://en.wikipedia.org/wiki/10*_\(number\)" list_of_urls

But if you want to know whether some website presents some content on urls of some format, you have a few possibilities.

There is some sitemap, where you can grab your list_of_urls and use it in grep. (http://en.wikipedia.org/wiki/Special:AllPages)
You have to build a list of these addresses and try them. There is no standard way for an HTTP server to advertise all its pages.
The Google's way - crawl the site following the links so you can find all public pages it has and then search in the list you've built.

Also, Google supports allinurl: and site: keywords, they could help you too.

Krab 2010-02-23 15:05:44

Answer 2

+1 A:

I see two problems to solve.

The first one: You don't have any real central directory of all URLs in the world, and even you will not have a sitemap on every site you know

An idea would be to check if a search engine (Google or other) let you works at URL level instead of content level for searching. You would then generate search query that could return list of sites matching your regex and try to do it.

The second one: For certain webservices which may exposing functions as resources, you may have an infinite URL list matching a regex

You may use several check to avoid this.

By the way, you are facing the same problem as every search engine ... making an inventory of all the web. No one ever solved this problem.

EDIT: webcrawler basic algorithm

take a list of seed sites
for each seed
  parse the webpage returned
  add each link found in the page to the seed list
  apply some algorithms for referencing the page to several keywords in a db

Kaltezar 2010-02-23 16:07:05

`making an inventory of all the web`. How do search engines search when they don't have the inventory?

Lazer 2010-02-23 16:53:38

They do an inventory. They use `webcrawler` to build a directory of the WWW following an algorithm like the one I putted in the answer.But they are not able to make a full web inventory because, like the universe, the web is constantly growing or collapsing.

Kaltezar 2010-02-23 20:24:32

ansaurus

tags:

views:

answers:

How to search a particular type of web addresses?

related questions