tags:

views:

427

answers:

3

Hi. I'm looking for a regular expression to extract all adjacent lines not containing HTML block tags, but they can contain HTML inline tags.

For example, if I have the following text ...

bla bla bla bla
bla <code>bla bla</code> bla
bla bla bla bla
<img src="" alt="" />
bla bla bla bla
<div> bla bla bla
bla bla bla

... I would like to extract only the following lines ...

bla bla bla bla
bla <code>bla bla</code> bla
bla bla bla bla
<img src="" alt="" />
bla bla bla bla

Is this possible to do with a regular expression?

Update: I'm working with PHP and I also have a variable containing the names of those block tags. It doesn´t matter if the block tag is an open tag or a close tag.

$blockTags = "h1|h2|h3|h4|h5|h6|hr|ol|ul|li|pre|blockquote|p|table|tr|td|div";
+2  A: 

Stop looking. Your task requires a parser that can understand when HTML tags open and close, and this is something that classical regular expressions can't do.

Modern regexes might be able to pull off a trick like this, but you will construct the most hideously undreadable regex the world has ever seen (well, not quite, but close) and if you ever need to change the behavior, you'll probably end up rewriting the whole thing. So write a relatively simple parser to do it for you, and don't spend hours trying to concoct some regex that someone else will later spend hours trying to understand.

By the way, if you ask a regex question, specify what language you're using. They work slightly differently in different languages.

Chris Lutz
Yes, I'll specify a language. But it doesn't matter if the tag is an open tag or a close tag.
Kiewic
When you said modern regexes might be able to pull off a trick like this, were you referring to backreferences ? e.g. grouping subexpressions with parentheses and recalling the value they match in the same expression? Allowing unlimited backreferences actually make pattern matching NP-complete.
Sean A.O. Harney
He doesn't care about open/close tags. He just wants lines that don't contain HTML divs or other tags. Don't really need a 'parser' for this, because it doesn't matter if it is in or outside of a block element.
Chacha102
Okay, there is a mix-up - do you want to get rid of lines that _contain_ HTML block tags, or get rid of lines _inside_ HTML block tags? Because the example in your post shows _inside_, but you seem to keep asking _contain_.
Chris Lutz
+1  A: 

Well, what you could do is, you could first filter the lines which don't contain any html tags with something like

[^<>]*

and then check if the line has any html inline-tags:

<(/?)(code|img|...)(/?)>

And the rest would be supposed to contain block-tags.
Don't know if this is accurate enough for you though.

x3ro
+1  A: 

Hi,

This is not "only one regex", but it should do the work, considering your input string is in $str :

$lines = explode(PHP_EOL, $str);
$linesToKeep = array();

foreach ($lines as $line) {
    if (!preg_match('#</?(' . $blockTags . ')>#', $line)) {
        $linesToKeep[] = $line;
    }
}

// Et voila ;-)
$strOK = implode(PHP_EOL, $linesToKeep);
var_dump($strOK);

In a few words :

  • It explodes the string to work on line (as you want to keep or reject line by line).
  • it loops line by line
  • if the line doesn't contain <TAG> or </TAG>, it is put in the $linesToKeep array
  • in the end, the ouput string is built from what's in that array

Maybe there are shorter ways to do, though... But that one is easy enough to understand, I guess (not some kinda "regex hell" or whatever that noone would be able to maintain ^^ )

Edit : As I was re-reading the OP, I noticed the last line was excluded, while it's not with my code... If you want to exclude a line with an opening tag, and the one just after it, here's another proposition :

$lines = explode(PHP_EOL, $str);
$linesToKeep = array();
$i = 0;
$numLines = count($lines);

for ($i=0 ; $i<$numLines ; $i++) {
    $line = $lines[$i];
    if (!preg_match('#</?(' . $blockTags . ')>#', $line)) {
        $linesToKeep[] = $line;
    } else {
        if (preg_match('#<(' . $blockTags . ')>#', $line)) {
            // Opening tag, skip next line too ?
            $i++;
        }
    }
}

$strOK = implode(PHP_EOL, $linesToKeep);
var_dump($strOK);

And if you want to skip lines until the closing tag, you can do that where I put $i++ -- but it's becoming to become harder to read / understand ^^ (And "parsing" HTML by-hand might not be such a good idea, if you want to get to something complicated ^^ )

Pascal MARTIN
Hey, your first attempt is better, it just need a BREAK on an ELSE statement. To simulate a preg_replace() call, I just join $linesToKeep[], make changes, an join with the rest of the lines.
Kiewic