views:

68

answers:

4

Ok, so I know this question has been asked in different forms several times, but I am having trouble with specific syntax. I have a large string which contains html snippets. I need to find every link tag that does not already have a target= attribute (so that I can add one as needed).

^((?!target).)* will give me text leading up to 'target', and <a.+?>[\w\W]+?</a> will give me a link, but thats where I'm stuck. An example:

<a href="http://www.someSite.com&gt;Link&lt;/a&gt; (This should be a match)
<a href="SomeLink.whatever target="_blank">Link</a> (this should not be a match).  

Any suggestions? Using DOM or XPATH are not really options since this snippet is not well-formed html.

A: 

If you insist on doing it with Regex a pattern such as this should help...

<a(?![^>]*target=) [^>]*>.*?</a>

It's by no means 100% perfect technically speaking a tag can contain a > in places other than then end so it won't work for all HTML tags.

NB. I work with PHP, you may have to make slight syntax adjustments for Java.

Cags
That works perfectly! Thanks for your help.
Quad64Bit
A: 

You could try a negative lookahead like this: <a(?!.*?target.*?).*?>[\w\W]+?</a>

burningstar4
+2  A: 

You are being wilfully evil by trying to parse HTML with Regexes. Don't.

That said, you are being extra evil by trying to do everything in one regexp. There is no need for that; it makes your code regex-engine-dependent, unreadable, and quite possibly slow. Instead, simply match tags and then check your first-stage hits again with the trivial regex /target=/. Of course, that character string might occur elsewhere in an HTML tag, but see (1)... you have alrady thrown good practice out of the window, so why not at least make things un-obfuscated so everyone can see what you're doing?

Kilian Foth
Ok, The purpose of this post was to look for a solution for a match that takes into account exclusions. This has many applications. I am not parsing html with regex, I have already done that with xpath and DOM. I am looking to add something simple to several lines. If the only solution is to do it with a nasty multi-tierd match, then I will do it. I was hoping someone could answer my real question which had to do with the exclusion itself. Apparently regex has no such ability (and it should). What a pain to have to do nested inverse matches.
Quad64Bit
A: 

I didn't test this and spent about a minute writing it, but for your specific example if you can do it on the client-side, try this via the DOM:

var links = document.getElementsByTagName("a");

for (linkIndex=0; linkIndex < links.length; linkIndex++) {
    var link = links[linkIndex];

    if (link.href && !link.target) {
        link.target = "someTarget"
        // or link.setAttribute("target", "someTarget");
    }
}
nickyt
You can also do this via jQuery, but I thought it would be better to use plain old JS in case you weren't using jQuery.
nickyt
Ok, I'll give that a shot too. I was looking for a way to do this in DOM, perhaps this will work. Thanks!
Quad64Bit