tags:

views:

503

answers:

4

Lets say I have this string

      $string = "<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 ";

I want the result to be "h1" because it is the latest unclosed tag

another example

if the string is

     $string = "<body>
                <img src='' alt=
               ";

the result should be "img" tag because it is the latest unclosed tag

I knew it could be done by regular expressions but I am not good in using the regular expressions

+3  A: 

I doubt that it's possible to do this with just a few regular expressions, since it's not a pattern you are searching for.

I'd go through the string using a stack and everytime you see an opening tag you put it on the stack and everytime you find the matching closing tag you remove it from the stack.

So if you went through the first part of example1:

<html>
  <body>
    <h1>
      <b>

Your stack should be:

html,body,h1,b

Next b closes and you remove it from the stack, so your stack looks like this:

html, body, h1

Now the tag that's on top of your stack(h1) is always the one you're looking for.

I hope you get what I mean, if not let me know.

André Hoffmann
A: 

I almost started to write a regular expression, but I gave up after realizing that I also have to ignore comments and strings (such as attribute values) containing text that could potentially evaluated as a closing tag:

 $string = "<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 <!--</h1> maybe it's silly to have such a comment but who knows-->
                 ";
presario
Well, that's not really a problem, you can remove comments, store new string in new variable and then do regex on it.
usoban
+3  A: 

My advice is to use a real parser, not a regex.

Max Kosyakov
I thought of that too, but the thing is, that his inputs seem to be mostly invalid sourcecode. So parsing it wouldn't be a very smart move if you're asking me.
André Hoffmann
We use htmlpurifier in our software to fix the invalid code submitted by user. That the main task of htmlpurifier.
Max Kosyakov
Yes, but he doesn't want to fix corrupted code. His input is corrupted code and he wants to work with it. A parser would fix the issues and thereby alter the result. That's why I'd vote against using a parser.
André Hoffmann
André, we use HTML purifier exactly to fix the corrupted code.
Max Kosyakov
Yes I know. Did you even read my comment? He does _NOT_ want to fix the code. He wants to know what the last element is and if you fix the code the last element is different from what it was before. That's why you can't use htmlpurifier.
André Hoffmann
A: 

The code below uses a couple of regexes to do the parsing. Beware though that real world html might easily break it when inserting random spaces, tabs etz inside tags and code. The code below includes an array of test cases to run problem code through.

The idea here is to first clean up the html, then remove tags with closing tags and finally return the last tag available.

<html>

<head><title>Last Open HTML Tag</title>

<body>

<h1>Last Open HTML Tag</h1>
<?php

$htmlstrings[] ="<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 ";

$htmlstrings[] ="<html>
                 <body>
                 <h3>test</h3>
                 <h1>
                 <b>aaa <i>test2</i></b> <i>test</i> bbbb
                 ";

$htmlstrings[] = "<body>
                <img src='' alt=
               ";

$htmlstrings[] = "<body>
                < img src='' alt=
               ";

$num = 1;              
foreach( $htmlstrings as $rawstring){
    // First remove whitespace in tags
    $string = preg_replace ( "/<\s*(\w)/", "<$1", $rawstring);
//    $string = preg_replace ( "/<\s*/\s*(\w)/", "</$1", $string);

    $real_matches = array();

    // Find open html tag (<a ...)
    if( preg_match( "/<(\w*)\W[^><]*$/", $string, $matches) > 0){
        $real_matches = $matches;
    // Find html tag with no end tag (<h1>...)
    } else {
        $newstrin = null;
        while( true){
            $newstring = preg_replace( "/<(\\w*)>[^<>]*<\\/\\1>/s", "", $string);
            if( $newstring == $string){
                break;
            }
            $string = $newstring;
        }
        preg_match( "/<(\\w*)>[^<>]*$/", $newstring, $matches);
        $real_matches = $matches;
    }

    echo "<p>Parse $num\n";
    $rawstring = preg_replace ( "/</is", "&lt;", $rawstring);
    $rawstring = preg_replace ( "/>/is", "&gt;", $rawstring);
    echo "<br>$rawstring\n";
    foreach( $real_matches as $match){
        $result = preg_replace ( "/</is", "&lt;", $match);
        $result = preg_replace ( "/>/is", "&gt;", $result);
         echo "<br>" . $result . "\n";
    }
    $num++;

    echo "<br>LAST OPEN TAG: " . $matches[1] . "\n";
} 

?>
</body>
</html>
Simon Byholm