views:

70

answers:

2

I have a bunch of tweets that are returned as plain text that I would like to go through and assign proper links tags to based on RegEx matches.

As an example here is a tweet where I would like @Bundlehunt to become <a href="http://twitter.com/bundlehunt"&gt;@Bundlehunt&lt;/a&gt; and the http://bundlehunt.com should become <a href="http://bundlehunt.com"&gt;http://bundlehunt.com&lt;/a&gt;.

Sample Tweet:

joined @BundleHunt for a chance to win the 2010 Mega Bundle! 
http://bundlehunt.com * Only 10 Days Left!

Sounds simple enough I thought so I used the excellent http://www.gskinner.com/RegExr/ tool to find the following 2 RegEx pattern that match those things in my tweets:

@twittername = /@(\w.+?)(?=\s)/gi
@links = /http:\/\/(.*)\.([a-zA-Z\.]){2,3}/gi

Now back in my jQuery document I am trying to go through the text and match the RegEx but that's where I get lost…

How do I actually go about matching plain text, wrapping around the anchor tags and inserting the matched text into the proper anchor tag?

Thanks for reading,

Jannis

+1  A: 

If you were to use jQuery's .html() method on untrusted input, your web application would be vulnerable to a cross-site scripting (XSS) attack, which would be exploitable by posting a malicious tweet. The best way to avoid this security problem is to append each part of the tweet individually, using the correct jQuery functions that use the web browser's DOM functions to HTML-escape strings.

  1. First, combine the two regexes into one using regex alternation (| symbol). For the purposes of my example code, the Twitter username regex is /@\w+/gi and the URL regex is /(?:https?|ftp):\/\/.*?\..*?(?=\W?\s)/gi These regexes are not the same as those in the original question; the original URL regex did not seem to work correctly, and we need not use capturing groups. The combined regex is therefore /@\w+|(?:https?|ftp):\/\/.*?\..*?(?=\W?\s)/gi.

  2. For each time the regex matches, securely add the text that comes before the match to the container. To do this in jQuery, create an empty "span" element and use the .text() method to insert text inside. Using $('text here') would leave an XSS hole wide open. What if the contents of a tweet are <script>alert(document.cookie)</script>?

  3. Check the first character of the match to determine how it is to be formatted. Twitter usernames begin with "@", but URLs cannot.

  4. Format the match and add it to the container. Again, do not pass untrusted input to the $ or jQuery function; use the .attr() method to add attributes such as href and the .text() method to add link text.

  5. After all matches have been processed, add the last plain text part of the tweet, which had not been added in step 3 or 4.

Example code (also at http://jsfiddle.net/6X6xD/3/):

var tweet = 'joined @BundleHunt for a chance to win the 2010 Mega Bundle! http://bundlehunt.com * Only 10 Days Left! URL containing an at sign: http://www.last.fm/event/1196311+Live+@+Public+Assembly. This should not work: <scr'+'ipt>alert(document.cookie)</scr'+'ipt>';

var combinedRegex = /@\w+|(?:https?|ftp):\/\/.*?\..*?(?=\W?\s)/gi,
    container = $('#tweet-container');

var result, prevLastIndex = 0;
combinedRegex.lastIndex = 0;
while((result = combinedRegex.exec(tweet))) {
    // Append the text coming before the matched entity
    container.append($('<span/>').text(tweet.slice(prevLastIndex, result.index)));
    if(result[0].slice(0, 1) == "@") {
        // Twitter username was matched
        container.append($('<a/>')
            // .slice(1) cuts off the first character (i.e. "@")
            .attr('href', 'http://twitter.com/' + encodeURIComponent(result[0].slice(1)))
            .text(result[0])
        );
    } else {
        // URL was matched
        container.append($('<a/>')
            .attr('href', result[0])
            .text(result[0])
        );
    }
    // prevLastIndex will point to the next plain text character to be added
    prevLastIndex = combinedRegex.lastIndex;
}
// Append last plain text part of tweet
container.append($('<span/>').text(tweet.slice(prevLastIndex)));

Note: older versions of this answer did recommend using the .html() method. Because this is a serious security problem as mentioned above, I have used the edit button to post my new answer, removing the old one from view.

idealmachine
this is great. thank you very much!
Jannis
A: 

Easiest thing is to use the replace method of the String Object:

var TWITTER_NAME =  /@(\w.+?)(?=\s)/gi , LINK = /http:\/\/(.*)\.([a-zA-Z\.]){2,3}/gi ;

var string = "joined @BundleHunt for a chance to win the 2010 Mega Bundle! \n http://bundlehunt.com * Only 10 Days Left!"

    string.replace(
        TWITTER_NAME,
        function(str,c1,c2) { 
            return "<a href=\"http://www.twitter.com/" + c1.toLowerCase() + ">" + str + "</a>" ;
        }
    ) ; 
    string.replace(LINK,"<a href=\"$&\">$&</a>") ;

See here for documentation: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/replace .


On a side note, if your string contains more than one substring that matches either regular expression you will have to run this in a loop, since the handling of capturing groups, i.e. the part inside parens, in JavaScript is awful.

FK82