ansaurus

Question

Answer 1

A:

$subject = "<div id=currency_converter_result>1 USD = <span class=bld>46.4400 INR</span>";
$pattern = '/<div id=currency_converter_result>.*?<span.*?>(.*?)<\/span>/';
preg_match($pattern, $subject, $matches);
print_r($matches);

Am 2009-11-28 12:00:49

Hi Am, that worked great, thanks a lot for your support

Kamal Challa 2009-11-28 12:24:47

I recommend you check the DOM option as well. It is better and less likely to break on HTML changes.

Am 2009-11-28 12:43:33

Answer 2

+1 A:

Why would you use regular expressions? I think you should read your x/html document into simlpleXml and use xpath to retrieve the desired value. Of course you can use regular expressions, but a xpath-solution would be nicer, imo.

$xml = simplexml_load_file("/path/to/document.html");
$node = $xml->xpath("/path/in/doc/to/span[class=bld]");
...

Björn 2009-11-28 12:02:13

Thanks Björn, but looking for regular expressions

Kamal Challa 2009-11-28 12:26:42

Answer 3

+7 A:

// Create a DOM object from a URL
$html = file_get_html('http://www.example.com/');

echo $html->find('span.bld', 0)->innertext;

http://simplehtmldom.sourceforge.net/manual.htm

karim79 2009-11-28 12:05:00

+1 for not linking to the most voted answer on SO and providing the alternative instead

Amarghosh 2009-11-28 12:20:33

Awesome, i've never seen this before.

Gary Willoughby 2009-11-28 12:58:22

Answer 4

A:

DOM+Xpath > Regex:

<?php
$str = '
<div id=currency_converter_result>1 USD = <span class=bld>46.4400 INR</span>
<input type=submit value="Convert">
</div>';

$d = new DOMDocument();
$d->loadHTML( $str );
$x = new DOMXpath($d);
$xpr = $x->evaluate('//span[contains(@class, "bld")]');
if ( count( $xpr ) ) {
    foreach ( $xpr as $el ) {
    echo $el->nodeValue;
    }
}
?>

Of course feel free to use simplexml or other similar libraries that involve less code.

Example of the chosen answer breaking, if the HTML was altered as Milan suggested:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

$pattern = '/<div id=currency_converter_result>.*?<span.*?>(.*?)<\/span>/';
preg_match($pattern, $subject, $matches);
print_r($matches); // output is Array ( )

Other regex answer breaking:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

preg_match('#<span class=bld>(.*?)</span>#', $subject, $match);
$value = $match[1];
var_dump($value); // outputs NULL

My DOM/Xpath solution works perfectly with the altered markup:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

$d = new DOMDocument();
$d->loadHTML( $subject );
$x = new DOMXpath($d);
$xpr = $x->evaluate('//span[contains(@class, "bld")]');
if ( count( $xpr ) ) {
    foreach ( $xpr as $el ) {
    echo $el->nodeValue; // output 46.4400 INR
    }
}
?>

meder 2009-11-28 12:05:02

Thanks meder, but looking for regular expressions

Kamal Challa 2009-11-28 12:25:37

Regular Expressions aren't capable of fully parsing HTML, which is why sane developers use solutions like relying on DOM.

meder 2009-11-28 12:30:05

@meder: if you have a subtle error in HTML like tag that is not closed, most browsers will switch to quicks mode and ignore it, while most DOM parsers will choke. Experienced developers do use regex instead of DOM.

Milan Babuškov 2009-11-28 13:08:24

@Milan - DOM is far more reliable, see my updated example of the HTML being altered and the chosen solution breaking entirely while mine stays perfectly fine.

meder 2009-11-28 13:23:59

You'd have to update the regex every single time to take into account the altered markup. And regarding your comment on DOM parsers choking - you should be validating your markup anyway as it's a necessity and best practice.

meder 2009-11-28 13:30:26

@meder: it's just a matter of picking your examples -- `<span>` becomes `<em>` and all break. Maybe there is no need to account for a secondary class, maybe spans with a secondary class must not be considered at all (in this case DOM breaks, regex doesn't). There is no solution **always** good, depends on the specifications.

kemp 2009-11-28 13:32:32

But the fact is that DOM is more reliable, as clearly demonstrated by the examples. Changing it to an entirely new element would of course break it, but the simple fact that the regular expressions break and the DOM solution stays fine means it's more reliable.

meder 2009-11-28 13:34:50

What if the OP does **NOT** want `<span class=bld one>` but **only** `<span class=bld>`?

kemp 2009-11-28 13:35:55

I'm simply demonstrating that it's more reliable, because I didn't have to update my original solution whereas if your regex solution was in use, it would break, hence the more reliable part. Would you agree that it's more reliable in that sense?

meder 2009-11-28 13:39:40

No, it is not more reliable because - as I said - the double class span might very well **not** be the one needed. Maybe the required data is only contained in a `span`s with `class=bld` while multi class `span`s contain other things of no interest. It is more reliable just because you picked a specific example to make it so.

kemp 2009-11-28 14:12:03

Answer 5

+2 A:

I think people are going too far in this "can't use regex to parse html" holy war. There is a difference between parsing (X|HT)ML and parsing a simple string which happens to contain a few HTML tags.

According to the specifications in the question this should do:

preg_match('#<span class=bld>(.*?)</span>#', $string, $match);
$value = $match[1];

kemp 2009-11-28 12:24:28

If a secondary class value gets added, boom it's broken.

meder 2009-11-28 12:28:55

That's why I said **according to the specifications in the question**. Not always things are bound to change, the best solution strictly depends on the problem: if the problem is specific you don't need a general solution.

kemp 2009-11-28 12:52:39

@meder: and if designer puts it all into another div XPath also gets broken. Regex is much easier to maintain because it allows you to focus on the part of the page you want to extract, not the whole DOM hierarchy.

Milan Babuškov 2009-11-28 13:06:47

@Milan - If you're referring to my xpath solution then no, it would still work because it only fetches span elements.

meder 2009-11-28 13:19:07

you can even get away with the pattern `([0-9.]+\s*INR)`

Amarghosh 2009-11-28 14:05:00

ansaurus

tags:

views:

answers:

Regular Expression In PHP

related questions