views:

622

answers:

2

Is there a better way to format text from Twitter to link the hyperlinks, username and hashtags? What I have is working but I know this could be done better. I am interested in alternative techniques. I am setting this up as a HTML Helper for ASP.NET MVC.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.Mvc;

namespace Acme.Mvc.Extensions
{

    public static class MvcExtensions
    {
        const string ScreenNamePattern = @"@([A-Za-z0-9\-_&;]+)";
        const string HashTagPattern = @"#([A-Za-z0-9\-_&;]+)";
        const string HyperLinkPattern = @"(http://\S+)\s?";

        public static string TweetText(this HtmlHelper helper, string text)
        {
            return FormatTweetText(text);
        }

        public static string FormatTweetText(string text)
        {
            string result = text;

            if (result.Contains("http://"))
            {
                var links = new List<string>();
                foreach (Match match in Regex.Matches(result, HyperLinkPattern))
                {
                    var url = match.Groups[1].Value;
                    if (!links.Contains(url))
                    {
                        links.Add(url);
                        result = result.Replace(url, String.Format("<a href=\"{0}\">{0}</a>", url));
                    }
                }
            }

            if (result.Contains("@"))
            {
                var names = new List<string>();
                foreach (Match match in Regex.Matches(result, ScreenNamePattern))
                {
                    var screenName = match.Groups[1].Value;
                    if (!names.Contains(screenName))
                    {
                        names.Add(screenName);
                        result = result.Replace("@" + screenName,
                           String.Format("<a href=\"http://twitter.com/{0}\"&gt;@{0}&lt;/a&gt;", screenName));
                    }
                }
            }

            if (result.Contains("#"))
            {
                var names = new List<string>();
                foreach (Match match in Regex.Matches(result, HashTagPattern))
                {
                    var hashTag = match.Groups[1].Value;
                    if (!names.Contains(hashTag))
                    {
                        names.Add(hashTag);
                        result = result.Replace("#" + hashTag,
                           String.Format("<a href=\"http://twitter.com/search?q={0}\"&gt;#{1}&lt;/a&gt;",
                           HttpUtility.UrlEncode("#" + hashTag), hashTag));
                    }
                }
            }

            return result;
        }

    }

}
+2  A: 

That is remarkably similar to the code I wrote that displays my Twitter status on my blog. The only further things I do that I do are

1) looking up @name and replacing it with <a href="http://twitter.com/name"&gt;Real Name</a>;

2) multiple @name's in a row get commas, if they don't have them;

3) Tweets that start with @name(s) are formatted "To @name:".

I don't see any reason this can't be an effective way to parse a tweet - they are a very consistent format (good for regex) and in most situations the speed (milliseconds) is more than acceptable.

Edit:

Here is the code for my Tweet parser. It's a bit too long to put in a Stack Overflow answer. It takes a tweet like:

@user1 @user2 check out this cool link I got from @user3: http://url.com/page.htm#anchor #coollinks

And turns it into:

<span class="salutation">
    To <a href="http://twitter.com/user1"&gt;Real Name</a>,
    <a href="http://twitter.com/user2"&gt;Real Name</a>:
</span> check out this cool link I got from
<span class="salutation">
    <a href="http://www.twitter.com/user3"&gt;Real Name</a>
</span>:
<a href="http://site.com/page.htm#anchor"&gt;http://site.com/...&lt;/a&gt;
<a href="http://twitter.com/#search?q=%23coollinks"&gt;#coollinks&lt;/a&gt;

It also wraps all that markup in a little JavaScript:

document.getElementById('twitter').innerHTML = '{markup}';

This is so the tweet fetcher can run asynchronously as a JS and if Twitter is down or slow it won't affect my site's page load time.

Rex M
I have a problem with my code if a URL has a hash character. I tried using \b to define word boundaries but that is not working. I am not sure if the Django example will work for me in C# but I am trying it out.
Brennan
@Brennan as far as I can tell, Hashtags can be alphanumeric. Capture URLs first (that way you catch any URLs with #), then run your hashtag regex on the fragments that weren't picked up by the URL replacer.
Rex M
I am not sure how to do that with Regex in C#. Do you have an example?
Brennan
@Brennan here is an even better way - `(?<!://[^\s]+)#[A-Za-z0-9]+` should match any # followed by 1 or more alphanumeric characters when not preceded by :// and any number of characters that is not a space. That should match all hashtags but any # inside a URL will fail.
Rex M
@Brennan please see my edited answer.
Rex M
A: 

Here's how I handle it in a Django app, which works nicely. It preserves whitespace and punctuation without having a single false positive (which is a pitfall here, since URLs frequently contain #.

from django.utils import html
post = html.urlize(post)

post = re.sub('(?P<before>\A|[^a-zA-Z0-9_])#(?P<hash>[a-zA-Z0-9_]+)(?P<after>\Z|[^a-zA-Z0-9_\"\'<]+)','\g<before>#<a href="http://search.twitter.com/?q=\g&lt;hash&gt;"&gt;\g&lt;hash&gt;&lt;/a&gt;\g&lt;after&gt;',post)

post = re.sub('(?P<before>\A|[^a-zA-Z0-9_])@(?P<username>[a-zA-Z0-9_]+)','\g<before>@<a href="http://twitter.com/\g&lt;username&gt;"&gt;\g&lt;username&gt;&lt;/a&gt;',post)
cpharmston
An example in C# would be great.
Brennan
I don't know C#, but the regex should be roughly transferrable.
cpharmston