ansaurus

Question

How to lookup a url on a page

Answer 1

+12 A:

Patrick Daryll Glandien 2009-03-20 12:49:20

+1 I agree, page scraping is much more fun with DOM parsing as opposed to regexes.

karim79 2009-03-20 12:55:36

+1 although it will only work on well formed XHTML documents.

vartec 2009-03-20 12:58:36

Thanks, I just knew how Xpath is useful :)

Omar Abid 2009-03-20 14:24:10

1- change the 'img' to "img"2- this give a plenty of warnings, how to disable them? (but it works and gives the results needed)

Omar Abid 2009-03-20 14:32:16

'img' should work just aswell. To disable the warnings prefix an @ before the $dom->loadHTML() like this: @$dom->loadHTML($html);

Patrick Daryll Glandien 2009-03-20 14:43:00

the 'img' don't work, you need "img" (u already used the '')Parse error: syntax error, unexpected T_STRING in C:\Program Files\Abyss Web Server\htdocs\grab.php on line 9

Omar Abid 2009-03-20 17:10:59

Answer 2

+1 A:

A pattern like <span.* class="img".*>([^<]*)</span> should work fine., assuming your code looks something like

<span class="img">http://www.img.com/img.jpg&lt;/span&gt;
<span alt="yada" class="img">animage.png</span>
<span alt="yada" class="img" title="still works">link.txt</span>
<span>not an img class</span>


<?php

$pattern = '@<span.* class="img".*>([^<]*)</span>@i';

//$subject = html code above

preg_match_all($pattern, $subject, $matches);

print_r($matches);

?>

David Caunt 2009-03-20 13:09:40

this shows the complete span :)any way that's a good starting pointI choose to work with this because it's safer if my page don't load completely :D

Omar Abid 2009-03-20 14:51:27

I think $matches[0] will contain the full match (e.g. <span...</span>) but $matches[1] will contain the first captured expression: the bit inside the <span>

David Caunt 2009-03-20 14:58:57

I strongly advise against using regex, read the blog entry for more...

Patrick Daryll Glandien 2009-03-20 15:56:47

Using the DOM definitely represents a better solution, I agree, if the right extensions are available and the markup is valid.

David Caunt 2009-03-20 16:21:35

your matches is a jagged array$matches[0] = array containing the findings$matches[1] = empty

Omar Abid 2009-03-20 17:16:55

Answer 3

+1 A:

I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion?

We-e-ell...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen(url).read()
sieve = SoupStrainer(name='span', attrs={'class': 'img'})
tag_soup = BeautifulSoup(html, parseOnlyThese=sieve)
for link in tag_soup('a'):
    print link['href']

(that's python, using BeautifulSoup - should work on most douments, well-formed or no).

elo80ka 2009-03-20 14:20:22

you get links from this 'soup' i'll look if there's something alike in PHP it's very useful

Omar Abid 2009-03-20 14:48:19

ansaurus

tags:

views:

answers:

How to lookup a url on a page

related questions