ansaurus

Question

How to find the last HTML tag which has not closed using regular expressions?

Answer 1

+3 A:

I doubt that it's possible to do this with just a few regular expressions, since it's not a pattern you are searching for.

I'd go through the string using a stack and everytime you see an opening tag you put it on the stack and everytime you find the matching closing tag you remove it from the stack.

So if you went through the first part of example1:

<html>
  <body>
    <h1>
      <b>

Your stack should be:

html,body,h1,b

Next b closes and you remove it from the stack, so your stack looks like this:

html, body, h1

Now the tag that's on top of your stack(h1) is always the one you're looking for.

I hope you get what I mean, if not let me know.

André Hoffmann 2009-08-05 10:27:04

Answer 2

A:

I almost started to write a regular expression, but I gave up after realizing that I also have to ignore comments and strings (such as attribute values) containing text that could potentially evaluated as a closing tag:

 $string = "<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 <!--</h1> maybe it's silly to have such a comment but who knows-->
                 ";

presario 2009-08-05 10:36:57

Well, that's not really a problem, you can remove comments, store new string in new variable and then do regex on it.

usoban 2009-08-05 10:59:00

Answer 3

+3 A:

My advice is to use a real parser, not a regex.

Max Kosyakov 2009-08-05 12:42:06

I thought of that too, but the thing is, that his inputs seem to be mostly invalid sourcecode. So parsing it wouldn't be a very smart move if you're asking me.

André Hoffmann 2009-08-05 13:37:50

We use htmlpurifier in our software to fix the invalid code submitted by user. That the main task of htmlpurifier.

Max Kosyakov 2009-08-12 05:00:22

Yes, but he doesn't want to fix corrupted code. His input is corrupted code and he wants to work with it. A parser would fix the issues and thereby alter the result. That's why I'd vote against using a parser.

André Hoffmann 2009-08-25 21:47:33

André, we use HTML purifier exactly to fix the corrupted code.

Max Kosyakov 2009-09-02 07:55:48

Yes I know. Did you even read my comment? He does _NOT_ want to fix the code. He wants to know what the last element is and if you fix the code the last element is different from what it was before. That's why you can't use htmlpurifier.

André Hoffmann 2009-09-05 11:58:26

Answer 4

A:

The code below uses a couple of regexes to do the parsing. Beware though that real world html might easily break it when inserting random spaces, tabs etz inside tags and code. The code below includes an array of test cases to run problem code through.

The idea here is to first clean up the html, then remove tags with closing tags and finally return the last tag available.

<html>

<head><title>Last Open HTML Tag</title>

<body>

<h1>Last Open HTML Tag</h1>
<?php

$htmlstrings[] ="<html>
                 <body>
                 <h1>
                 <b>aaa</b> bbbb
                 ";

$htmlstrings[] ="<html>
                 <body>
                 <h3>test</h3>
                 <h1>
                 <b>aaa <i>test2</i></b> <i>test</i> bbbb
                 ";

$htmlstrings[] = "<body>
                <img src='' alt=
               ";

$htmlstrings[] = "<body>
                < img src='' alt=
               ";

$num = 1;              
foreach( $htmlstrings as $rawstring){
    // First remove whitespace in tags
    $string = preg_replace ( "/<\s*(\w)/", "<$1", $rawstring);
//    $string = preg_replace ( "/<\s*/\s*(\w)/", "</$1", $string);

    $real_matches = array();

    // Find open html tag (<a ...)
    if( preg_match( "/<(\w*)\W[^><]*$/", $string, $matches) > 0){
        $real_matches = $matches;
    // Find html tag with no end tag (<h1>...)
    } else {
        $newstrin = null;
        while( true){
            $newstring = preg_replace( "/<(\\w*)>[^<>]*<\\/\\1>/s", "", $string);
            if( $newstring == $string){
                break;
            }
            $string = $newstring;
        }
        preg_match( "/<(\\w*)>[^<>]*$/", $newstring, $matches);
        $real_matches = $matches;
    }

    echo "<p>Parse $num\n";
    $rawstring = preg_replace ( "/</is", "&lt;", $rawstring);
    $rawstring = preg_replace ( "/>/is", "&gt;", $rawstring);
    echo "<br>$rawstring\n";
    foreach( $real_matches as $match){
        $result = preg_replace ( "/</is", "&lt;", $match);
        $result = preg_replace ( "/>/is", "&gt;", $result);
         echo "<br>" . $result . "\n";
    }
    $num++;

    echo "<br>LAST OPEN TAG: " . $matches[1] . "\n";
} 

?>
</body>
</html>

Simon Byholm 2009-08-09 09:16:08

ansaurus

tags:

views:

answers:

How to find the last HTML tag which has not closed using regular expressions?

related questions