views:

100

answers:

3

We've got a large amount of static that HTML has links like e.g.

<a href="link.html#glossary">Link</a>

However some of them contain spaces in the anchor e.g.

 <a href="link.html#this is the glossary">Link</a>

Any ideas on what kind of regular expression I'd need to use to find the Spaces after the # and replace them with a - or _

Update: Just need to find them using TextMate, hence no need for a HTML parsing lib.

+2  A: 

Have you considered using an HTML parsing library like BeautifulSoup? It would make finding all the hrefs much easier!

ראובן
+1 — parse HTML with an HTML parser, not regular expressions.
David Dorward
ah yes, should have mentioned I just need to find them all within TextMate, I've updated my question.
Tom
+2  A: 

This regex should do it:

#[a-zA-Z]+\s+[a-zA-Z\s]+

Three Caveats.

First, if you are afraid that the page text itself (and not just the links) might contain information like "#hashtag more words", then you could make the regex more restrictive, like this:

#[a-zA-Z]+\s+[a-zA-Z\s]+\">

Second, if you have hash tags that contain characters beyond A-Z, then just add them in between the second set of brackets. So, if you have '-' as well, you would modify to:

#[a-zA-Z]+\s+[a-zA-Z-\s]+\">

Finally, this assumes that all the links you are trying to match start with a letter/word and are followed by a space, so, in the current form, it would not match "Anchor-tags-galore", but would match "Anchor tags galore."

Mark Hammonds
Thanks muchly, the links only contain A-Z so one of these is bound to do the trick :)
Tom
+1  A: 

Here, this regex matches the hash and all the words and spaces in between:

#(\w+\s)+\w+

http://dl.getdropbox.com/u/5912/Jing/2009-08-12_1651.png

When you have some time, you should download "The Regex Coach", which is an awesome tool to develop your own regexes. You get instant feedback and you learn very fast. Plus it comes at no cost!

Visit the homepage

Sebastian Hoitz
Looks awesome, but there isn't a mac version :(
Tom
Maybe you can try this one: http://www.rustyrazorblade.com/2007/12/02/regex-coach-mac-substitute/
Sebastian Hoitz