views:

53

answers:

3

Hi! I have an url. How to know all the existed subUrls of this page. For example,

  1. http://tut.by/car/12324 - exists
  2. ................/car/66666 - doesn`t exist

Desirably, in java. I have already experimented with almost all from java-source.net/open-source/crawlers - no one can do that, they can only go by hrefs. Thx in advance!

+2  A: 

What you seek is not possible. The server defines the actual meaning of the path in an URL, and it's not possible to 'guess' unless you know a great deal about the server and how it processes the URLs.

Tassos Bassoukos
understood. So, i`ll emulate user activity over httpClient.
dementiev
+2  A: 

That's going to be nearly impossible, if there's no index page. While many web servers will create an HTML index page for you if one isn't provided by the site creator, it's a very common practice to disable directory listing, for security reasons.

Curtis
A: 

I agree, the information you would be seeking would be in an index page. I.e. sometimes you go on a website and delete the "page.html" part. And volia you see all the pages and folders in that directory.

But as mentioned, this is often disabled for security reasons, so users cannot wander around.

Therefore, your other choices are to either

A) Guess, just keep trying different combinations to brute force the page URLs, 00001, 00002, 00003, etc

B) Crawl the website start at its root, looking for links in a page to another page on the website, until all links have been exhausted. Obviously pages on the site will no links to it will never be found.

C) As the owner of the website for the information you require.

JonWillis