tags:

views:

501

answers:

3

Given a string like this:

<a href="http://blah.com/foo/blah"&gt;This is the foo link</a>

... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:

<a href="http://blah.com/foo/blah"&gt;This is the <b>foo</b> link</a>

However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.

So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?

Note: I promise that the HTML in question will never be anything pathological like:

<img title="Haha! Here are some angle brackets to screw you up: ><" />

Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.

Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."

Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.

So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

+5  A: 

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use Template::Refine::Fragment;

my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world.  <a href="http://foo.com/"&gt;This is a test of foo finding.</a>  Here is another foo.');

say $frag->process(
    simple_replace {
        my $n = shift;
        my $text = $n->textContent;
        $text =~ s/foo/<foo>/g;
        return XML::LibXML::Text->new($text);
    } '//text()',
)->render;

This outputs:

<p>Hello, world.  <a href="http://foo.com/"&gt;This is a test of &lt;foo&gt; finding.</a>  Here is another &lt;foo&gt;.</p>

Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".

Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

jrockway
Okay, but I'm autogenerating all the HTML. It's extremely simple HTML. I can't in good faith justify bringing in an entire heavyweight library just to slap bold tags around a few strings.
raldi
Do what you want. My time is not wasted when you reinvent a square wheel. (Parsing HTML with regexes is very difficult. As your examples show, it's hard to get right.)
jrockway
Regexes fail when considering comments and CDATA sections.(Regex-based parsers are fine, but you need to store more state thanregexes can store alone. That's why you have a parser instead of arandom regex.)
jrockway
I generate the HTML myself. It doesn't have comments or CDATA sections. The script is 25 lines. I'm not going to add a dependency on an external file -- what you're proposing is the definition of overengineering.
raldi
But you see, I already did the engineering for you. Reuse... have you heard of it?
jrockway
Bloat: Have you heard of it?
raldi
Stupid back and forth: have either of you heard of it?
Dave Rolsky
http://xkcd.com/386/ :)
raldi
True, that comic _is_ jrockway, but you _are_ wrong.
Dave Rolsky
It takes a lot of hubris (and not the good kind) to presume that you know more about the nature of my project than I do. In this case, the convenience of being able to copy a single file around -- no CPAN requirements -- far outweighs the benefits of being able to parse CDATA and other such
raldi
...forms of complex HTML that it'll never actually be asked to parse.
raldi
@raldi: Two things. First, the risk here is that even though you generate the HTML yourself, your requirements may change someday. Using a pre-built solution eases the burden of making it work in the face of changing requirements. Of course, only you can judge whether that's reasonable here.
Adam Bellaire
@raldi: Second, even if this isn't the best solution for you, it's good for this solution to be here so that if someone else with a similar (but not the same) problem finds your question here, jrockway's answer might work when the one for your exact requirements won't.
Adam Bellaire
Exactly. I am explaining how to do this generically. If you wantspecific help for your exact needs, my usual consulting rate applies :)
jrockway
@Adam: If my requirements change someday, and it turns out this was a huge mistake, I can undo it by removing a single line of code. It's not a big concern here.
raldi
+6  A: 

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:

s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
David Zaslavsky
True... this is part of why others are saying that you really should be using an HTML parser rather than a simple regex. And I actually agree with them, but if you really want to use s/// then knock yourself out ;-)
David Zaslavsky
These are all broken. Try to highlight "foo" in "foo<blafoo>foo blabla foo\n fooo</foo>"
vladr
Reinventing the wheel is so fun!
jrockway
Now this is interesting, an accepted answer with -3 votes... I should have deleted it :-(
David Zaslavsky
@Vlad: Thanks for the test case -- but again, I generate the HTML myself. It can only have one of a few small number of possible forms, and that's not one of them. Still, I've updated the regex to handle your test case.
raldi
@raldi: I stand corrected.
vladr
Voted up: it's bullshit to vote someone down for trying to answer the OP's question as the OP wants. Yes, you might think it's reinventing the wheel, and everyone knows you can't write a full, proper HTML parser with a regex. But the OP wants what he wants (and has reasons). No sense kicking David.
Telemachus
Thanks for the support ;-) (I gave you a random upvote for that comment)
David Zaslavsky
(\G|>) is sufficient; pos() is reset when starting a s///g.
ysth
+1  A: 

The following regex will match all text between tags or outside of tags:

<.*?>(.*?)<.*?>|>(.*?)<

Then you can operate on that as desired.

mletterle