ansaurus

Question

Search and replace text contents of a tag

Answer 1

A:

What language? In Perl, try s/\// \/ /g.

Majd Taby 2009-03-04 06:04:12

This would screw up his URLs. I don't think that's what he wants.

Chris Lutz 2009-03-04 06:06:04

Answer 2

+3 A:

This isn't really the kind of thing regular expressions are good at doing. You'll probably be better off using an HTML or XML parser - it creates a tree of nodes out of the document, and then you can just step through all the text nodes that are inside of tags and add spaces as needed.

David Zaslavsky 2009-03-04 06:06:18

Answer 3

+2 A:

This Regex should do the trick:

(\s*/\s*(?=[^<>]+<))

It will only replace the '/' within tags and not URLs.

In C#:

 myHtml = Regex.Replace(myHtml, @"(\s*/\s*(?=[^<>]+<))", " / ");

In Perl:

$myHtml =~ s!(\s*/\s*(?=[^<>]+<))! / !g;

In JavaScript:

myHtml = myHtml.replace(/(\s*\/\s*(?=[^<>]+<))/g, " / ");

Note:

in these examples, the whole document must be loaded in the myHtml string.
If you work on a single line at a time, it obviously won't work if there are newlines inside the tags or in-between tag pairs.

Renaud Bompuis 2009-03-04 06:08:37

This works perfectly, thanks! Still trying to wrap my head round it though! ;)

Block 2009-03-04 06:17:35

The Regex has a positive lookahead to only match those / that are followed by and opening tag bracket. So if the / is in a URL, it won't match because it is followed by a closing tag bracket.

Renaud Bompuis 2009-03-04 06:22:51

It will not work if the closing tag is on a different line though. It may or may not be a problem, but at least you should document it.

mirod 2009-03-04 06:28:25

@mirod: not sure what you mean because in my tests, it works regarless of how many newlines are in the input string, even if they split the text within the tags, or if they split the content of the HTML tags themselves.

Renaud Bompuis 2009-03-04 06:40:03

I doesn't work when you use the regexp to loop on lines (as in perl -p).

mirod 2009-03-04 06:53:34

Of course it can't work if you're working on a single line at a time. It will work if, like in my examples, you work on the whole document only. I thought it was obvious but thanks for clearing that out.

Renaud Bompuis 2009-03-04 07:05:05

Added note to clarify usage.

Renaud Bompuis 2009-03-04 07:08:37

@mirod, for what I doing (formatting HTML in a text editor) Renaud's regex work's well even across multiline breaks. Note I have the multiline flag enabled.

Block 2009-03-04 07:09:04

OK, one last thing that can go wrong: > is allowed in the text, so you could (at least theoretically!) have the text of a link that's '3x/2 > 4', in which case the / would not be expanded.

mirod 2009-03-04 08:18:15

> in text should be escaped to ⊃ otherwise, it's not HTML even though browsers are in general tolerant of it.If the HTML is malformed, then there is no simple answer to the question because you can always construct HTML that would fail a simple regex, even most parsers expect some compliance.

Renaud Bompuis 2009-03-04 08:41:04

No, > is allowed both in XML and in HTML. Neither tidy nor xmlwf complain about it, even though they excape it by default.

mirod 2009-03-04 09:15:26

I meant >, not ⊃. The HTML spec says that you should avoid use of < and > literals to avoid confusion in text fragments. http://www.w3.org/TR/html401/charset.html#entities Anyway, it's beside the point. I never said my Regex could pass all tests, it simply answers the original question.

Renaud Bompuis 2009-03-04 09:32:07

I understand that in practice, in the specific case of reformatting HTML in a text editor, the regexp probably works well enough. I just want to avoid someone with a similar problem in a different context being bitten by its limitations. (and _should_ in the W3C TR means that > is indeed allowed)

mirod 2009-03-04 09:59:44

Answer 4

A:

I think we're lacking a bit of context here. Is the data HTML, XML, or just fragments of text with tags?

If it is HTML or XML, as mentioned often, regexps are not safe, unless you control exactly the format of the data, and you know that you will always control it. And you document it.

I would use an appropriate parser if I were you. If you have Perl and XML::Twig installed, the following one-liner will do:

perl -MXML::Twig -e'XML::Twig->parse( keep_spaces => 1, "my_file.xml")->subs_text( "/", " / ")->print'

If you're dealing with well-formed XML with no comments and no CDATA sections, then a more efficient way would be to use PYX (you need to install XML::PYX):

pyx my_file.xml | perl -p -e's{/}{ / }g if m{-}' | pyxw

mirod 2009-03-04 06:20:33

Thanks for the tip on using TWIG!

Block 2009-03-04 07:10:01

No problem, considering I wrote XML::Twig, it might even be considered a shameless plug ;--)

mirod 2009-03-04 07:38:53

Answer 5

A:

If you need to, you could try using a regex to extract the text between two tags, and then process that, and then re-insert it, but this task is probably more complicated than a single regex due to your constraints.

Here's something in Perl that works (but doesn't use regexes):

my (@a, $in_tag);
foreach(split //, $string) { # assuming $string holds our string
  $in_tag = 1 if $_ eq "<";
  $in_tag = 0 if $_ eq ">";
  if($_ eq "/" and not $in_tag) {
    push @a, " ", "/", " ";
  }
  else {
    push @a, $_;
  }
}
$string = join "", @a;

This, however, is not a regex, but a very simple parser.

Chris Lutz 2009-03-04 06:28:21

ansaurus

tags:

views:

answers:

Search and replace text contents of a tag

related questions