ansaurus

Question

Is regex the right tool to find a line of HTML?

Answer 1

+3 A:

According to Jeff Atwood, you should never parse HTML using regex.

Asaph 2009-11-19 03:35:57

Ugh. That post has absolutely no architectural rationale for explaining why regex are a bad idea. And the article itself even says "It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine."

rascher 2009-11-19 03:42:53

@rascher - The point is that it is a solved problem. Do you make raw HTTP posts in code by opening up a socket and passing hand rolled byte arrays? No... not even for a "trivial" example because there are libraries that do this for you. The chances that you introduce a bug because of some input you didn't count on is high because regex is a poor way to parse HTML. It will probably take you less time to download some library and use it than it will take you to form a proper regex pattern to do what you want.

Josh 2009-11-19 03:58:08

If what you want to get from the HTML page is very simple and generally unique, its ok to use regex, or even simple string functions. "should never" is a strong word.

ghostdog74 2009-11-19 03:59:28

regexp are perfectly good to match a SINGLE tag `<nodename\b[^<>]*>` **after loading the whole file in memory, obviously**. Closing said tag and getting the whole element it's a different matter.

ZJR 2009-11-19 04:05:32

Show me a regex that parses HTML and I'll show you how to break it.

Asaph 2009-11-19 04:07:15

@Asaph - Fedor Emelianenko can make unbreakable regexes that parse HTML :)

meder 2009-11-19 04:20:53

It's not about whether or not the regex can be broken, it's whether or not it's applied in a way in which it would be. For example, if you have some known HTML that you need to extract data for, it works just fine.I in fact did this just yesterday -- it doesn't matter if something absurdly malformed would break it because it's not for sifting through user input or unknown data sets where that would be a problem.Never is rarely absolutely never.

Zurahn 2009-11-19 04:31:10

@Zurahn: You don't even need malformed HTML to break a regex HTML parser. I could do it with a simple CDATA tag.

Asaph 2009-11-19 04:35:10

Oh, and for the record: I believe certain tags are safe to parse using regex - the one's that you can't nest; i.e. <a> <form> <body> <img> etc. Just so long as the html is well formed. If you can rely on it being well-formed (i.e. you have control over it) I'd say you'd be fine.

Iain Fraser 2009-11-19 06:32:24

@Iain Fraser: The tags you mentioned are most certainly *not* safe to parse with regex even on valid, well formed HTML. A script tag, CDATA, or HTML comment inside a <form> or <body> tag can validly contain </form> or </body> and trip up a naive regex. The HTML parsing problem just contains too many edge cases to be elegantly handled by a regex. And let's not even talk about malformed HTML which is all too real. For the love of God, people, just use an HTML parser!

Asaph 2009-11-19 16:28:30

@Asaph this is true, but a fairly outside possibility if you know the structure of the content you're expecting to see. If you're simply parsing a single string out of a few lines of HTML once or twice, you'd be relatively safe using the regex method. You need to remember that generally we're going to know something about the string we're parsing. There's no need to account for every possible variance of valid or invalid html. In these cases you're not so much parsing the html as grabbing a segment of a regular string.

Iain Fraser 2009-11-23 05:55:53

Asaph 2009-11-23 06:39:00

Answer 2

+1 A:

Instead of RegEx, use a parser that is made especially to handle (messy) HTML. This will make your application less brittle in case the HTML changes slightly, and you don't have to hand-craft custom RegEx each time you want to pull out a new piece of data.

See this Stack Overflow page: Mature HTML Parsers for PHP

philfreo 2009-11-19 03:36:29

Answer 3

+3 A:

At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here

The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.

beggs 2009-11-19 03:44:20

@beggs I totally understand your point. There is no way you can create a full HTML parser with RegEx alone. However, there are certain situation and condition where the part of the HTML can be seen as just piece of text which RegEx is sufficient.

NawaMan 2009-11-19 03:59:33

@NamaMan, agreed. I'm a fan of "the right tool for the job" not "to a man with a hammer everything looks like a nail". Making the decision on the right tool is something that requires more knowledge of the project that we are likely to get here on SO, we can only help point the way.

beggs 2009-11-19 04:21:07

Answer 4

+1 A:

The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply.

Not recommended.

pavium 2009-11-19 03:45:45

Answer 5

+3 A:

Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.

Otherwise, DOMDocument would be a more sane way.

EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.

I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.

Here is a rought PHP code to extract it. (just to give some idea)

$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";

$ID = "Alpha";

function GetLineOfDIV($HTML, $ID) {
    $RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
    $Index       = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
    $Match       = $Match[1]; // Only the one in '(...)'
    if ($Match == "")
        return -1;

    //$MatchStr    = $Match[0]; Since you do not want it, so we comment it out.
    $MatchOffset = $Match[1];

    $StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
    foreach($StartLines as $I => $StartLine) {
        $LineOffset = $StartLine[1];
        if ($MatchOffset <= $LineOffset)
            return $I + 1;
    }
    return count($StartLines);
}

echo GetLineOfDIV($HTML, $ID);

I hope I give you some idea.

NawaMan 2009-11-19 03:52:08

+1 for DOMDocument reference

Darren Newton 2009-11-19 04:26:19

Answer 6

+1 A:

Since the line number is important to you here and not the actual contents of the div, I'd be inclined not to use regex at all. I'd probably explode() the string into an array and loop through that array looking for your marker. Like so:

<?php
$myContent = "[your string of html here]";
$myArray = explode("\n", $myContent);
$arraylen = count($myArray); // So you don't waste time counting the array at every loop
$lineNo = 0;
for($i = 0; $i < $arraylen; $i++)
{
     $pos = strpos($myArray[$i], 'id="Alpha"');
     if($pos !== false)
     {
          $lineNo = $i+1;
          break;
     }
}
?>

Disclaimer: I haven't got a php installation readily available to test this so some debugging may be required.

Hope this helps as I think it's probably just going to be a waste of time for you to implement a parsing engine just to do something so simple - especially if it's a one-off.

Edit: if the content is impotant to you at this stage too then you can use this in combination with the other answers which provide an adequate regex for the job.

Edit #2: Oh what the hey... here's my two cents:

"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"

The (<div.*//div>) tells the regex engine that it may find nested div tags and to just incorporate them into the match if it finds them rather than just stopping at the first </div>. However this only solves the problem if there is only one level of nesting. If there's more, then regex is not for you sorry :(.

The /m also makes the regex engine ignore linebreaks so you don't have to dirty up your expressions with [\S\s] everywhere.

Again, sorry, I've no environment to test this in at the moment so you may need to debug.

Cheers Iain

Iain Fraser 2009-11-19 05:40:48

Seems to be the correct approach.

Georg 2009-11-19 05:47:17

Answer 7

A:

@OP since your requirement is that easy, you can just use string methods

$f = fopen("file","r");
if($f){
    $s="";
    while( !feof($f) ){
        $i+=1;
        $line = fgets($f,4096);        
        if (stripos($line,'<div id="Alpha">')!==FALSE){
            print "line number: $i\n";
        }
    }
    fclose($f);
}

ghostdog74 2009-11-19 06:28:48

ansaurus

tags:

views:

answers:

Is regex the right tool to find a line of HTML?

related questions