views:

129

answers:

5

how to match html "a" tags, only the ones without http, using regular expression?

ie match:

blahblah... < a href=\"somthing\" > ...blahblah

but not

blahblah... < a href=\"http://someting\" > ...blahblah
A: 
var html = 'Some text with a <a href="http://example.com/"&gt;link&lt;/a&gt; and an <a href="#anchor">anchor</a>.';
var re = /<a href="(?!http:\/\/)[^"]*">/i;
var match = html.match(re);
// match contains <a href="#anchor">

Note: this won't work if you've additional attributes.

Lekensteyn
Won't work for `<a href="http.html">` or `<a href="http:foo.html">` (yes, `http:...` does not explicitly imply the HTTP protocol, as all browsers will ignore the "http:" part if there isn't two slashes; `http:/` is equivalent to `/`)
Eli Grey
Updated it to match literally `http://`. Note that browsers (at least Firefox) expands `//example.com/` to `http://example.com/` (or https, depending on the current protocol).
Lekensteyn
+6  A: 

It's more easy to use a DOMParser and XPath, not a regex.

See my response in jsfiddle.

HTML

<body>
    <div>
        <a href='index.php'>1. index</a>
        <a href='http://www.bar.com'&gt;2. bar</a>
        <a href='http://www.foo.com'&gt;3. foo</a>        
        <a href='hello.php'>4. hello</a>        
    </div>
</body>

JS

$(document).ready(function() {
    var type = XPathResult.ANY_TYPE;
    var page = $("body").html();
    var doc = DOMParser().parseFromString(page, "text/xml");
    var xpath = "//a[not(starts-with(@href,'http://'))]";
    var result = doc.evaluate(xpath, doc, null, type, null);

    var node = result.iterateNext();
    while (node) {
        console.log(node); // returns links 1 and 4
        node  = result.iterateNext();        
    }

});

NOTES

  1. I'm using jquery to have a small code, but you can do it without jquery.
  2. This code must be adapted to work with ie (I've tested in firefox).
Topera
If you use jQuery, then you might as well use `$("a:not([href^=http://])")` which works in IE.
Peter Ajtai
+4  A: 

You should use a XML parser instead of regexes.


On the same topic :

Colin Hebert
+2  A: 

With jquery, You can do something very simple:

links_that_doesnt_start_with_http = $("a:not([href^=http://])")

edit: Added the ://

Nicolas Viennot
+1 for an alternative that may do what the OP wants (they were quite vague as to the purpose).
Blair McMillan
`<a href="http.html">Nope.</a>
Eli Grey
@Eli The `://` part can be added in easily - the technique is essentially correct.
Yi Jiang
A: 

I'm interpreting your question in that you mean any (mostly) absolute URI with a protocol, and not just HTTP. To add to everyone else's incorrect solutions. You should be doing this check on the href:

if (href.slice(0, 2) !== "//" && !/^[\w-]+:\/\//.test(href)) {
    // href is a relative URI without http://
}
Eli Grey