ansaurus

Question

How to determine if an html tag splits across multiple lines

Answer 1

+1 A:

Well, this doesn't answer the question and is more of an opinion, but...

I think that the best scraping strategy (and consequently, to eliminate this problem) is not to analyze an HTML line by line, which is unnatural to HTML, but to analyze it by its natural delimiter: <> pairs.

There will be two types of course:

Tag elements that are immediately closed, e.g., < br />
Tag elements that need a separate closing tag, e.g., < p > text < /p >

You can immediately see the advantage of using this strategy in the case of paragraph(p) tags: It will be easier to parse mutiline paragraphs instead of having to track where the closing tag is.

Jon Limjap 2008-08-29 02:16:53

Answer 2

+6 A:

This is one of my pet peeves: never parse HTML by hand. Never parse HTML with regexps. Never parse HTML with string comparisons. Always use an HTML parser to parse HTML – that's what they're there for.

It's been a long time since I've done any PHP, but a quick search turned up this PHP5 HTML parser.

Jörg W Mittag 2008-08-29 02:19:03

Answer 3

+2 A:

Don't write a parser, use someone else's: DOMDocument::loadHTML - that's just one, I think there are a lot of others.

Josh 2008-08-29 02:21:59

Answer 4

A:

Why don't you read in a line, and set it to a string, then check the string for tag openings and closings, If a tag spans more then one line add the next line to the string and move the part before the opening brace to your processed string. Then just parse through the entire file doing this. Its not beautiful but it should work.

corymathews 2008-08-29 02:42:35

Answer 5

A:

If you've gotta stick to your current method of parsing, and it's a regex, you can use the multi-line flag "m" to span across multiple lines.

ceejayoz 2008-08-29 16:18:16

Answer 6

A:

Perhaps for future projects I'll use a parsing library, but that's kind of aside from the question at hand. This is my current solution. rstrpos is strpos, but from the reverse direction. Example use:

for($i=0; $i<count($lines); $i++)
{
    $line = handle_mulitline_tags(&$i, $line, $lines);
}

And here's that implementation:

function rstrpos($string, $charToFind, $relativePos)
{
    $searchPos = $relativePos;
    $searchChar = '';

    while (($searchChar != $charToFind)&&($searchPos>-1))
    {
        $newPos = $searchPos-1;
        $searchChar = substr($string,$newPos,strlen($charToFind));
        $searchPos = $newPos;
    }

    if (!empty($searchChar))
    {
        return $searchPos;
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

function handle_multiline_tags(&$i, $line, $lines)
{
    //if a tag is opened but not closed before a line break,

    $open = rstrpos($line, '<', strlen($line));
    $close = rstrpos($line, '>', strlen($line));
    if(($open > $close)&&($open > -1)&&($close > -1))
    {
     $i++;
     return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
    }
    else
    {
     return trim($line);
    }
}

This could probably be optimized in some way, but for my purposes, it's sufficient.

Factor Mystic 2008-08-29 16:20:57

ansaurus

tags:

views:

answers:

How to determine if an html tag splits across multiple lines

related questions