tags:

views:

105

answers:

5

Hi

i would like to parse following string to get the value "46.4400 INR"

<div id=currency_converter_result>1 USD = <span class=bld>46.4400 INR</span>
<input type=submit value="Convert">
</div>

What regular expression do i need to use for this?

Please help me out to get out of this

Thanks in Advance KAMAL CHALLA

A: 
$subject = "<div id=currency_converter_result>1 USD = <span class=bld>46.4400 INR</span>";
$pattern = '/<div id=currency_converter_result>.*?<span.*?>(.*?)<\/span>/';
preg_match($pattern, $subject, $matches);
print_r($matches);
Am
Hi Am, that worked great, thanks a lot for your support
Kamal Challa
I recommend you check the DOM option as well. It is better and less likely to break on HTML changes.
Am
+1  A: 

Why would you use regular expressions? I think you should read your x/html document into simlpleXml and use xpath to retrieve the desired value. Of course you can use regular expressions, but a xpath-solution would be nicer, imo.

$xml = simplexml_load_file("/path/to/document.html");
$node = $xml->xpath("/path/in/doc/to/span[class=bld]");
...
Björn
Thanks Björn, but looking for regular expressions
Kamal Challa
+7  A: 
// Create a DOM object from a URL
$html = file_get_html('http://www.example.com/');

echo $html->find('span.bld', 0)->innertext;

http://simplehtmldom.sourceforge.net/manual.htm

karim79
+1 for not linking to the most voted answer on SO and providing the alternative instead
Amarghosh
Awesome, i've never seen this before.
Gary Willoughby
A: 

DOM+Xpath > Regex:

<?php
$str = '
<div id=currency_converter_result>1 USD = <span class=bld>46.4400 INR</span>
<input type=submit value="Convert">
</div>';

$d = new DOMDocument();
$d->loadHTML( $str );
$x = new DOMXpath($d);
$xpr = $x->evaluate('//span[contains(@class, "bld")]');
if ( count( $xpr ) ) {
    foreach ( $xpr as $el ) {
    echo $el->nodeValue;
    }
}
?>

Of course feel free to use simplexml or other similar libraries that involve less code.

Example of the chosen answer breaking, if the HTML was altered as Milan suggested:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

$pattern = '/<div id=currency_converter_result>.*?<span.*?>(.*?)<\/span>/';
preg_match($pattern, $subject, $matches);
print_r($matches); // output is Array ( )

Other regex answer breaking:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

preg_match('#<span class=bld>(.*?)</span>#', $subject, $match);
$value = $match[1];
var_dump($value); // outputs NULL

My DOM/Xpath solution works perfectly with the altered markup:

<?php
$subject = '
<div>
<div id=currency_converter_result/><b/>1 USD = <span class=bld one>46.4400 INR</span>
<input type=submit value="Convert">
</div></div><span/>';

$d = new DOMDocument();
$d->loadHTML( $subject );
$x = new DOMXpath($d);
$xpr = $x->evaluate('//span[contains(@class, "bld")]');
if ( count( $xpr ) ) {
    foreach ( $xpr as $el ) {
    echo $el->nodeValue; // output 46.4400 INR
    }
}
?>
meder
Thanks meder, but looking for regular expressions
Kamal Challa
Regular Expressions aren't capable of fully parsing HTML, which is why sane developers use solutions like relying on DOM.
meder
@meder: if you have a subtle error in HTML like tag that is not closed, most browsers will switch to quicks mode and ignore it, while most DOM parsers will choke. Experienced developers do use regex instead of DOM.
Milan Babuškov
@Milan - DOM is far more reliable, see my updated example of the HTML being altered and the chosen solution breaking entirely while mine stays perfectly fine.
meder
You'd have to update the regex every single time to take into account the altered markup. And regarding your comment on DOM parsers choking - you should be validating your markup anyway as it's a necessity and best practice.
meder
@meder: it's just a matter of picking your examples -- `<span>` becomes `<em>` and all break. Maybe there is no need to account for a secondary class, maybe spans with a secondary class must not be considered at all (in this case DOM breaks, regex doesn't). There is no solution **always** good, depends on the specifications.
kemp
But the fact is that DOM is more reliable, as clearly demonstrated by the examples. Changing it to an entirely new element would of course break it, but the simple fact that the regular expressions break and the DOM solution stays fine means it's more reliable.
meder
What if the OP does **NOT** want `<span class=bld one>` but **only** `<span class=bld>`?
kemp
I'm simply demonstrating that it's more reliable, because I didn't have to update my original solution whereas if your regex solution was in use, it would break, hence the more reliable part. Would you agree that it's more reliable in that sense?
meder
No, it is not more reliable because - as I said - the double class span might very well **not** be the one needed. Maybe the required data is only contained in a `span`s with `class=bld` while multi class `span`s contain other things of no interest. It is more reliable just because you picked a specific example to make it so.
kemp
+2  A: 

I think people are going too far in this "can't use regex to parse html" holy war. There is a difference between parsing (X|HT)ML and parsing a simple string which happens to contain a few HTML tags.

According to the specifications in the question this should do:

preg_match('#<span class=bld>(.*?)</span>#', $string, $match);
$value = $match[1];
kemp
If a secondary class value gets added, boom it's broken.
meder
That's why I said **according to the specifications in the question**. Not always things are bound to change, the best solution strictly depends on the problem: if the problem is specific you don't need a general solution.
kemp
@meder: and if designer puts it all into another div XPath also gets broken. Regex is much easier to maintain because it allows you to focus on the part of the page you want to extract, not the whole DOM hierarchy.
Milan Babuškov
@Milan - If you're referring to my xpath solution then no, it would still work because it only fetches span elements.
meder
you can even get away with the pattern `([0-9.]+\s*INR)`
Amarghosh