tags:

views:

217

answers:

3

I'm new to Regular Expressions and things like that. I have only few knowledge and I think my current problem is about them.

I have a webpage, that contains text. I want to get links from the webpage that are only in SPANs that have class="img".

I go through those steps.

  1. grab all the SPANs tagged with the "img" class (this is the hard step that I'm looking for)
  2. move those SPANs to a new variable
  3. Parse the variable to get an array with the links (Each SPAN has only 1 link, so this will be easy)

I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion? Thanks :D

+12  A: 
Patrick Daryll Glandien
+1 I agree, page scraping is much more fun with DOM parsing as opposed to regexes.
karim79
+1 although it will only work on well formed XHTML documents.
vartec
Thanks, I just knew how Xpath is useful :)
Omar Abid
1- change the 'img' to "img"2- this give a plenty of warnings, how to disable them? (but it works and gives the results needed)
Omar Abid
'img' should work just aswell. To disable the warnings prefix an @ before the $dom->loadHTML() like this: @$dom->loadHTML($html);
Patrick Daryll Glandien
the 'img' don't work, you need "img" (u already used the '')Parse error: syntax error, unexpected T_STRING in C:\Program Files\Abyss Web Server\htdocs\grab.php on line 9
Omar Abid
+1  A: 

A pattern like <span.* class="img".*>([^<]*)</span> should work fine., assuming your code looks something like

<span class="img">http://www.img.com/img.jpg&lt;/span&gt;
<span alt="yada" class="img">animage.png</span>
<span alt="yada" class="img" title="still works">link.txt</span>
<span>not an img class</span>


<?php

$pattern = '@<span.* class="img".*>([^<]*)</span>@i';

//$subject = html code above

preg_match_all($pattern, $subject, $matches);

print_r($matches);

?>
David Caunt
this shows the complete span :)any way that's a good starting pointI choose to work with this because it's safer if my page don't load completely :D
Omar Abid
I think $matches[0] will contain the full match (e.g. <span...</span>) but $matches[1] will contain the first captured expression: the bit inside the <span>
David Caunt
I strongly advise against using regex, read the blog entry for more...
Patrick Daryll Glandien
Using the DOM definitely represents a better solution, I agree, if the right extensions are available and the markup is valid.
David Caunt
your matches is a jagged array$matches[0] = array containing the findings$matches[1] = empty
Omar Abid
+1  A: 

I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion?

We-e-ell...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen(url).read()
sieve = SoupStrainer(name='span', attrs={'class': 'img'})
tag_soup = BeautifulSoup(html, parseOnlyThese=sieve)
for link in tag_soup('a'):
    print link['href']

(that's python, using BeautifulSoup - should work on most douments, well-formed or no).

elo80ka
you get links from this 'soup' i'll look if there's something alike in PHP it's very useful
Omar Abid