views:

431

answers:

6

I'm writing a php script that involves scraping web pages. Currently, the script analyzes the page line by line, but it breaks if there is a tag that spans multiple lines, like

<img src="example.jpg"
alt="example">

If worse comes to worse, I could possibly preprocess the page by removing all line breaks, then re-inserting them at the closest >, but this seems like a kludge.

Ideally I'd be able to detect a tag that spans lines, conjoin only those to lines, and continue processing. So what's the best method to detect this?

+1  A: 

Well, this doesn't answer the question and is more of an opinion, but...

I think that the best scraping strategy (and consequently, to eliminate this problem) is not to analyze an HTML line by line, which is unnatural to HTML, but to analyze it by its natural delimiter: <> pairs.

There will be two types of course:

  • Tag elements that are immediately closed, e.g., < br />
  • Tag elements that need a separate closing tag, e.g., < p > text < /p >

You can immediately see the advantage of using this strategy in the case of paragraph(p) tags: It will be easier to parse mutiline paragraphs instead of having to track where the closing tag is.

Jon Limjap
+6  A: 

This is one of my pet peeves: never parse HTML by hand. Never parse HTML with regexps. Never parse HTML with string comparisons. Always use an HTML parser to parse HTML – that's what they're there for.

It's been a long time since I've done any PHP, but a quick search turned up this PHP5 HTML parser.

Jörg W Mittag
+2  A: 

Don't write a parser, use someone else's: DOMDocument::loadHTML - that's just one, I think there are a lot of others.

Josh
A: 

Why don't you read in a line, and set it to a string, then check the string for tag openings and closings, If a tag spans more then one line add the next line to the string and move the part before the opening brace to your processed string. Then just parse through the entire file doing this. Its not beautiful but it should work.

corymathews
A: 

If you've gotta stick to your current method of parsing, and it's a regex, you can use the multi-line flag "m" to span across multiple lines.

ceejayoz
A: 

Perhaps for future projects I'll use a parsing library, but that's kind of aside from the question at hand. This is my current solution. rstrpos is strpos, but from the reverse direction. Example use:

for($i=0; $i<count($lines); $i++)
{
    $line = handle_mulitline_tags(&$i, $line, $lines);
}

And here's that implementation:

function rstrpos($string, $charToFind, $relativePos)
{
    $searchPos = $relativePos;
    $searchChar = '';

    while (($searchChar != $charToFind)&&($searchPos>-1))
    {
        $newPos = $searchPos-1;
        $searchChar = substr($string,$newPos,strlen($charToFind));
        $searchPos = $newPos;
    }

    if (!empty($searchChar))
    {
        return $searchPos;
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

function handle_multiline_tags(&$i, $line, $lines)
{
    //if a tag is opened but not closed before a line break,

    $open = rstrpos($line, '<', strlen($line));
    $close = rstrpos($line, '>', strlen($line));
    if(($open > $close)&&($open > -1)&&($close > -1))
    {
     $i++;
     return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
    }
    else
    {
     return trim($line);
    }
}

This could probably be optimized in some way, but for my purposes, it's sufficient.

Factor Mystic