views:

397

answers:

5

I have a bunch of html with lines like this:

<a href="#" rel="this is a test">

I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob!

I'm using Textmate.

Can anyone help me?

/Jakob

A: 

Suppose you already received the value of rel:

var value = document.getElementById(id).getAttribute( "rel");
var rel = (new String( value)).replace( /\s/g,"_");
document.getElementById(id).setAttribute( "rel", rel);
Artem Barger
He is using textmate. That is an editor
Norbert Hartl
Upps, somehow I've missed that. o_O
Artem Barger
A: 

I don't think you can do this properly. Though I wonder why you need to do it at one go?

I can think of a really poor way of doing it, but even if I don't recommend it, here goes:

You could sort of do it with the regex below. However, you would have to increase the number of captures and outputs with a _ on the end to the potential number of spaces in the rel. I bet that is a requirement which disallows this solution.

Search:

{\<a *href\=\"[^\"]*" *rel\=\"}{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*

Replace:

\1\2_\3_\4_\5_\6_\7_\8_

This way has two downsides, one is there might be limitations to the number of captures you can have in Textmate, two is you'll end up with a large number of _'s on the end of each line.

With your current test, with the regex above, you would end up with:

<a href="#" rel="this_is_a_test">____

PS: This regex is of the format of the visual studio search/replace box. You'll probably need to change some characters to make it fit textpad.

 {} => capturing group

  () => grouping

  [^A] => anything but A

  ( |\")* => space or "

  \1 => is the first capture
Rune Sundling
Hey thanks!You gave me something to think about.You're absolutely right. I don't need to do it in one go.I found a way to match the first space, although it looks a bit like a joke:(?<=rel="[\w+][\w+][\w+][\w+])\s+(-:Anyway then I get:<a href="#" rel="this_is a test">I'm thinking I should be able to run the search/replace a few times until it stops getting matches.Basically replacing the spaces one at a time:<a href="#" rel="this_is a test"><a href="#" rel="this_is_a test"><a href="#" rel="this_is_a_test">Q's:How do I avoid the repeated [\w+]?Will it match the _'s?
Wow, the comment ate my newlines ...Hope it's still readable!
In Visual Studio syntax, this would work as you describe: Search:{\<a *href\=\"[^\"]*" *rel\=\"([^ ]*|[^\"]*)} Replace:\1_(note that it is a space after the last visible character in the regex to match a space)
Rune Sundling
but yes, w will match _
Rune Sundling
Thanks!But I get this result for the first run: rel="this _is a test"The space character is matched and inserted in the replacement string.It should be easy to remove the spaces afterwards, but the problem means I keep targeting the same location for insertion: rel="this _________is a test"
Sounds like you put the space on the inside of the capturing group and not the outside. it should be }space and not space}
Rune Sundling
A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens
A: 

I have to get on-board the "you're using the wrong tool for the job" train here. You have Textmate, so that means OSX, which means you have sed, awk, ruby and perl that can all do this much much better and easier.

Learning how to use one of these tools to do text manipulation will give you uncountable benefits in the future. Here is a URL that will ease you into sed: http://www.grymoire.com/Unix/Sed.html

Adam Luter
A: 

If you're using TextMate, then you're on a Mac, and therefore have Python.

Try this:

#!/usr/bin/env python

import re

input = open('test.html', 'r')

p_spaces = re.compile(r'^.*rel="[^"]+".*$')

for line in input:
    matches = p_spaces.findall(line)

    for match in matches:
        new_rel = match.replace(' ', '_')
        line = line.replace(match, new_rel)

    print line,

Sample output:

 $ cat test.html
testing, testing, 1, 2, 3
<a href="#" rel="this is a test">
<unrelated line>
Stuff
<a href="#" rel="this is not a test">
<a href="#" rel="this is not a test" rel="this is invalid syntax (two rels)">
aoseuaoeua

 $ ./test.py
testing, testing, 1, 2, 3
<a_href="#"_rel="this_is_a_test">
<unrelated line>
Stuff
<a_href="#"_rel="this_is_not_a_test">
<a_href="#"_rel="this_is_not_a_test"_rel="this_is_invalid_syntax_(two_rels)">
aoseuaoeua
ShawnMilo