ansaurus

Question

How do I match text in HTML that's not inside tags?

Answer 1

+5 A:

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use Template::Refine::Fragment;

my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world.  <a href="http://foo.com/"&gt;This is a test of foo finding.</a>  Here is another foo.');

say $frag->process(
    simple_replace {
        my $n = shift;
        my $text = $n->textContent;
        $text =~ s/foo/<foo>/g;
        return XML::LibXML::Text->new($text);
    } '//text()',
)->render;

This outputs:

<p>Hello, world.  <a href="http://foo.com/"&gt;This is a test of &lt;foo&gt; finding.</a>  Here is another &lt;foo&gt;.</p>

Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".

Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

jrockway 2009-02-22 04:15:06

Okay, but I'm autogenerating all the HTML. It's extremely simple HTML. I can't in good faith justify bringing in an entire heavyweight library just to slap bold tags around a few strings.

raldi 2009-02-22 04:17:59

Do what you want. My time is not wasted when you reinvent a square wheel. (Parsing HTML with regexes is very difficult. As your examples show, it's hard to get right.)

jrockway 2009-02-22 04:24:23

Regexes fail when considering comments and CDATA sections.(Regex-based parsers are fine, but you need to store more state thanregexes can store alone. That's why you have a parser instead of arandom regex.)

jrockway 2009-02-22 04:42:36

I generate the HTML myself. It doesn't have comments or CDATA sections. The script is 25 lines. I'm not going to add a dependency on an external file -- what you're proposing is the definition of overengineering.

raldi 2009-02-22 05:26:59

But you see, I already did the engineering for you. Reuse... have you heard of it?

jrockway 2009-02-22 05:30:13

Bloat: Have you heard of it?

raldi 2009-02-22 05:32:18

Stupid back and forth: have either of you heard of it?

Dave Rolsky 2009-02-22 05:34:40

http://xkcd.com/386/ :)

raldi 2009-02-22 05:43:13

True, that comic _is_ jrockway, but you _are_ wrong.

Dave Rolsky 2009-02-22 05:44:40

It takes a lot of hubris (and not the good kind) to presume that you know more about the nature of my project than I do. In this case, the convenience of being able to copy a single file around -- no CPAN requirements -- far outweighs the benefits of being able to parse CDATA and other such

raldi 2009-02-22 05:58:00

...forms of complex HTML that it'll never actually be asked to parse.

raldi 2009-02-22 05:58:44

@raldi: Two things. First, the risk here is that even though you generate the HTML yourself, your requirements may change someday. Using a pre-built solution eases the burden of making it work in the face of changing requirements. Of course, only you can judge whether that's reasonable here.

Adam Bellaire 2009-02-22 15:21:07

@raldi: Second, even if this isn't the best solution for you, it's good for this solution to be here so that if someone else with a similar (but not the same) problem finds your question here, jrockway's answer might work when the one for your exact requirements won't.

Adam Bellaire 2009-02-22 15:22:06

Exactly. I am explaining how to do this generically. If you wantspecific help for your exact needs, my usual consulting rate applies :)

jrockway 2009-02-22 16:39:19

@Adam: If my requirements change someday, and it turns out this was a huge mistake, I can undo it by removing a single line of code. It's not a big concern here.

raldi 2009-02-24 03:03:45

Answer 2

+6 A:

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:

s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

David Zaslavsky 2009-02-22 04:26:05

True... this is part of why others are saying that you really should be using an HTML parser rather than a simple regex. And I actually agree with them, but if you really want to use s/// then knock yourself out ;-)

David Zaslavsky 2009-02-22 04:30:22

These are all broken. Try to highlight "foo" in "foo<blafoo>foo blabla foo\n fooo</foo>"

vladr 2009-02-22 04:36:53

Reinventing the wheel is so fun!

jrockway 2009-02-22 04:43:42

Now this is interesting, an accepted answer with -3 votes... I should have deleted it :-(

David Zaslavsky 2009-02-22 05:20:30

@Vlad: Thanks for the test case -- but again, I generate the HTML myself. It can only have one of a few small number of possible forms, and that's not one of them. Still, I've updated the regex to handle your test case.

raldi 2009-02-22 05:37:50

@raldi: I stand corrected.

vladr 2009-02-22 06:05:02

Voted up: it's bullshit to vote someone down for trying to answer the OP's question as the OP wants. Yes, you might think it's reinventing the wheel, and everyone knows you can't write a full, proper HTML parser with a regex. But the OP wants what he wants (and has reasons). No sense kicking David.

Telemachus 2009-02-22 12:29:53

Thanks for the support ;-) (I gave you a random upvote for that comment)

David Zaslavsky 2009-02-22 12:54:40

(\G|>) is sufficient; pos() is reset when starting a s///g.

ysth 2009-02-22 18:55:19

Answer 3

+1 A:

The following regex will match all text between tags or outside of tags:

<.*?>(.*?)<.*?>|>(.*?)<

Then you can operate on that as desired.

mletterle 2009-02-22 04:29:38

ansaurus

tags:

views:

answers:

How do I match text in HTML that's not inside tags?

related questions