ansaurus

Question

Answer 1

+1 A:

First of all, regex and HTML don't mix. Use:

foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a)
{
  $a->getAttribute('href');
}

Links that may go outside your site start with protocol or //, i.e.

http://example.com
//example.com/

href="google.com" is link to a local file.

But if you want to create static copy of a site, why not just use wget?

porneL 2008-12-11 22:55:02

Answer 2

A:

You would have to look for http:// in the href. Else, you could determine if it starts with ./ or any combination of "./". If you don't find a "/" then you would have to assume that its a file. Would you like a script for this?

James Hartig 2008-12-12 01:38:59

sure that would be a great help! :)

E3 2008-12-12 01:53:11

Answer 3

A:

Let's first consider the properties of local links.

These will either be:

relative with no scheme and no host, or
absolute with a scheme of 'http' or 'https' and a host that matches the machine from which the script is running

That's all the logic you'd need to identify if a link is local.

Use the parse_url function to separate out the different components of a URL to identify the scheme and host.

Jon Cram 2008-12-15 19:16:14

be careful with parse_url it fails really easily :P

James Hartig 2008-12-27 03:23:42

ansaurus

tags:

views:

answers:

Web crawler links/page logic in PHP

related questions