views:

792

answers:

1

I am programming a small script using PHP and regular expressions

The aim of this script is to extract all links in a page and convert these links to absolute- when it is relative - I figured how does relative links works but their are some questions also

lets say we have this page http://www.example.com/xxx1/xxx2/xxx3.html if this page has the following links

index.html --- the absolute link will be http://www.example.com/xxx1/xxx2/index.html

./index.html --- the absolute link will be http://www.example.com/xxx1/xxx2/index.html

../index.html --- the absolute link will be http://www.example.com/xxx1/index.html

/index.html --- the absolute link will be http://www.example.com/index.html

so

index.html = will open in the current directory

./index.html = will also open in the current directory

../index.html = will open in the parent directory

/index.html = will open in the root directory

the problem is what if the URL is a search engine friendly?

say we have this URL

((case1)): http://www.example.com/xxx1/xxx2/xxx3/index/

or

((case2)): http://www.example.com/xxx1/xxx2/xxx3/index

is "index" in case1 a directory or a page?is it a directory in case2 or a page? and how the following links will look like as absolute links in both cases 1 and 2

index.html --- ?

./index.html --- ?

../index.html --- ?

/index.html --- ?

I am not sure if it is an easy question for some of you but for me it is confusing?

Thanks :)

+3  A: 

Direct answer to your example

In case 1, index is a "directory component" of the URL, while in case 2 index is a "file component" of the URL. This is independent of whether it actually is a regular file or directory on the web server -- see the explanation below. I'd call both a "page" if an HTML page is served by the server on those URLs.

Case 1: (Links from http://www.example.com/xxx1/xxx2/xxx3/index/)

  1. index.html -> http://www.example.com/xxx1/xxx2/xxx3/index/index.html
  2. ./index.html -> http://www.example.com/xxx1/xxx2/xxx3/index/index.html
  3. ../index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
  4. /index.html -> http://www.example.com/index.html

Case 2: (Links from http://www.example.com/xxx1/xxx2/xxx3/index)

  1. index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
  2. ./index.html -> http://www.example.com/xxx1/xxx2/xxx3/index.html
  3. ../index.html -> http://www.example.com/xxx1/xxx2/index.html
  4. /index.html -> http://www.example.com/index.html

So the only one that stays the same is the absolute links - 4.

Explanation

Links are relative to the URL the browser is at, which may not be the URL you originally entered (for example on an HTTP redirect). Most web browsers will update the URL bar with the current address once you follow a link or are redirected, so unless you just edited that, the address you see there is the one that counts.

URLs ending with a slash are considered to refer to directories (implied by RFC2396 for URI syntax, though it does not actually call them that way), else they are considered to refer to files within directories.

--Side note: This will not necessarily correspond to the filesystem path (if there is one) type used by the web server to serve the file. Most web servers, when asked requested a URL mapping to a directory on their filesystem, will either serve a file within the directory with some set name (often index.html, but the selection can usually be configured), or an HTML directory listing generated by the server (or an access error if that was disabled). The same will usually be served when a "file URL" for the similiar path without a trailing slash is requested, in which case the "file URL" actually maps to a directory filesystem path.--

This can lead to inconsistencies such as the above example, where the "file URL" http://www.example.com/xxx1/xxx2/xxx3/index is probably equivalent to the "directory URL" http://www.example.com/xxx1/xxx2/xxx3/index/, but relative links may refer to different paths from those two URLs, and one may work and the other may be broken.

For that reason, when a linking to a directory, it is recommended to always use the "directory URL" (with the terminating slash) and not the equivalent "file URL" - e.g. link to http://www.ietf.org/meetings/ and not http://www.ietf.org/meetings even if both would serve the same page. Many web servers are in fact configured to redirect clients requesting the latter to the former using a an HTTP 301 redirect response. You can see this if you enter the latter in your browser's URL bar - the URL bar will change to the former once it gets that response.

Thanks, Brilliant, thanks very much Tom, you have clearfyed everything for me, I wish you the best, and now I can complete my project :)
ahmed