views:

14286

answers:

5

Hello, I have a paragraph of text in a javascript variable called 'input_content' and that text contains multiple anchor tags/links. I would like to match all of the anchor tags and extract anchor text and URL, and put it into an array like (or similar to) this:

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com"&gt;Yahoo&lt;/a&gt;
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com"&gt;Google&lt;/a&gt;
            [1] => http://google.com
            [2] => Google
        )
)

I've taken a crack at it (http://pastie.org/339755), but I am stumped beyond this point. Thanks for the help!

+11  A: 
var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

To break down the regular expression:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

  • arguments[1] is the entire anchor
  • arguments[2] is the href part
  • arguments[3] is the text inside

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

var input_content = "blah \
    <a href=\"http://yahoo.com\"&gt;Yahoo&lt;/a&gt; \
    blah \
    <a href=\"http://google.com\"&gt;Google&lt;/a&gt; \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

Gives:

<a href="http://yahoo.com">Yahoo&lt;/a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google&lt;/a>,http://google.com,Google
Ates Goral
Don't necessarily agree that regex is best for this, but upvote for taking the time to put out the good explanation and what do once you have your matches.
Joel Coehoorn
Many thanks for providing all the great detail on the regex. Huge help to understand how () works to capture tokens.
chipotle_warrior
I agree that regex is probably an overkill for this. And in retrospect, I could have set up the loop to use a `while (tokens = regex.exec(patt))` instead of the `replace` hack.
Ates Goral
+5  A: 

Since you're presumably running the javascript in a web browser, regex seems like a bad idea for this. If the paragraph came from the page in the first place, get a handle for the container, call .getElementsByTagName() to get the anchors, and then extract the values you want that way.

If that's not possible then create a new html element object, assign your text to it's .innerHTML property, and then call .getElementsByTagName().

Joel Coehoorn
+2  A: 

I think JQuery would be your best bet. This isn't the best script and I'm sure others can give something better. But this creates an array of exactly what you're looking for.

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
     $("a").each(function(){
      items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
      i++;   
     });
    });

    function showItems(){
     alert(items);
    }

</script>
Thanks. Since I am using jQuery, this seems to work best for my purposes. However, in the JSON notation 'el:el' is returning the URL, not the markup of the anchor tag. Any idea why?
chipotle_warrior
el should be the actual element of the document. I was looking around to see how to get the outerHtml property but didn't see it anywhere. But with the element, el, you could add that back into the DOM if you wanted to.
...the actual element of the anchor tag...
Figured it out!!! http://brandonaaron.net has an outerHTML() addon. Include that and update one line and whala!.jQuery.fn.outerHTML = function() { return $('<div>').append( this.eq(0).clone() ).html();};items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
I Updated my answer. Didn't know I could do that. :)
+2  A: 

I think Joel has the right of it — regexes are notorious for playing poorly with markup, as there are simply too many possibilities to consider. Are there other attributes to the anchor tags? What order are they in? Is the separating whitespace always a single space? Seeing as you already have a browser's HTML parser available, best to put that to work instead.

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

This will return an array like the one you describe regardless of how the links are stored. Note that you could change the function to work with a passed element instead of text by changing the parameter name to "container" and removing the first two lines. The textContent/innerText property gets the text displayed for the link, stripped of any markup (bold/italic/font/…). You could replace .textContent with .innerHTML and remove the inner if() statement if you want to preserve the markup.

Ben Blank
A: 
enter code here:

alert(list);

lustness monster