I'm writing a basic crawler that simply caches pages with PHP.
All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a>
- at the moment it returns:
Array {
[url] => URL
[desc] => DESCRIPTION
}
The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory.
It could be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.
What would be the correct algorithm to approach this? I don't want to lose any data that could be important.