views:

734

answers:

5

I am trying to specifically to pad out the /'s in an a tag's text.

1234/1234/ABCDE => 1234 / 1234 / ABCDE

In context; if I have an a tag:

<a href="http://www.domain.com/path/to/page.html"&gt;12 34/1234A/BC DEFG</a>

I would like to get:

<a href="http://www.domain.com/path/to/page.html"&gt;12 34 / 1234A / BC DEFG</a>
A: 

What language? In Perl, try s/\// \/ /g.

Majd Taby
This would screw up his URLs. I don't think that's what he wants.
Chris Lutz
+3  A: 

This isn't really the kind of thing regular expressions are good at doing. You'll probably be better off using an HTML or XML parser - it creates a tree of nodes out of the document, and then you can just step through all the text nodes that are inside of tags and add spaces as needed.

David Zaslavsky
+2  A: 

This Regex should do the trick:

(\s*/\s*(?=[^<>]+<))

It will only replace the '/' within tags and not URLs.

In C#:

 myHtml = Regex.Replace(myHtml, @"(\s*/\s*(?=[^<>]+<))", " / ");

In Perl:

$myHtml =~ s!(\s*/\s*(?=[^<>]+<))! / !g;

In JavaScript:

myHtml = myHtml.replace(/(\s*\/\s*(?=[^<>]+<))/g, " / ");

Note:

in these examples, the whole document must be loaded in the myHtml string.
If you work on a single line at a time, it obviously won't work if there are newlines inside the tags or in-between tag pairs.

Renaud Bompuis
This works perfectly, thanks! Still trying to wrap my head round it though! ;)
Block
The Regex has a positive lookahead to only match those / that are followed by and opening tag bracket. So if the / is in a URL, it won't match because it is followed by a closing tag bracket.
Renaud Bompuis
It will not work if the closing tag is on a different line though. It may or may not be a problem, but at least you should document it.
mirod
@mirod: not sure what you mean because in my tests, it works regarless of how many newlines are in the input string, even if they split the text within the tags, or if they split the content of the HTML tags themselves.
Renaud Bompuis
I doesn't work when you use the regexp to loop on lines (as in perl -p).
mirod
Of course it can't work if you're working on a single line at a time. It will work if, like in my examples, you work on the whole document only. I thought it was obvious but thanks for clearing that out.
Renaud Bompuis
Added note to clarify usage.
Renaud Bompuis
@mirod, for what I doing (formatting HTML in a text editor) Renaud's regex work's well even across multiline breaks. Note I have the multiline flag enabled.
Block
OK, one last thing that can go wrong: > is allowed in the text, so you could (at least theoretically!) have the text of a link that's '3x/2 > 4', in which case the / would not be expanded.
mirod
> in text should be escaped to ⊃ otherwise, it's not HTML even though browsers are in general tolerant of it.If the HTML is malformed, then there is no simple answer to the question because you can always construct HTML that would fail a simple regex, even most parsers expect some compliance.
Renaud Bompuis
No, > is allowed both in XML and in HTML. Neither tidy nor xmlwf complain about it, even though they excape it by default.
mirod
I meant >, not ⊃. The HTML spec says that you should avoid use of < and > literals to avoid confusion in text fragments. http://www.w3.org/TR/html401/charset.html#entities Anyway, it's beside the point. I never said my Regex could pass all tests, it simply answers the original question.
Renaud Bompuis
I understand that in practice, in the specific case of reformatting HTML in a text editor, the regexp probably works well enough. I just want to avoid someone with a similar problem in a different context being bitten by its limitations. (and _should_ in the W3C TR means that > is indeed allowed)
mirod
A: 

I think we're lacking a bit of context here. Is the data HTML, XML, or just fragments of text with tags?

If it is HTML or XML, as mentioned often, regexps are not safe, unless you control exactly the format of the data, and you know that you will always control it. And you document it.

I would use an appropriate parser if I were you. If you have Perl and XML::Twig installed, the following one-liner will do:

perl -MXML::Twig -e'XML::Twig->parse( keep_spaces => 1, "my_file.xml")->subs_text( "/", " / ")->print'

If you're dealing with well-formed XML with no comments and no CDATA sections, then a more efficient way would be to use PYX (you need to install XML::PYX):

pyx my_file.xml | perl -p -e's{/}{ / }g if m{-}' | pyxw
mirod
Thanks for the tip on using TWIG!
Block
No problem, considering I wrote XML::Twig, it might even be considered a shameless plug ;--)
mirod
A: 

If you need to, you could try using a regex to extract the text between two tags, and then process that, and then re-insert it, but this task is probably more complicated than a single regex due to your constraints.

Here's something in Perl that works (but doesn't use regexes):

my (@a, $in_tag);
foreach(split //, $string) { # assuming $string holds our string
  $in_tag = 1 if $_ eq "<";
  $in_tag = 0 if $_ eq ">";
  if($_ eq "/" and not $in_tag) {
    push @a, " ", "/", " ";
  }
  else {
    push @a, $_;
  }
}
$string = join "", @a;

This, however, is not a regex, but a very simple parser.

Chris Lutz