views:

69

answers:

3

I'm producing HTML from twitter search results. Happily using the Net::Twitter module :-)

One of the rules in Twitter is that all-numeric hashtags are not links. This allows to unambiguously tweet things like "ur not my #1 anymore", as in here: http://twitter.com/natarias2007/status/11246320622

The solution I came up with looks like:

$tweet =~ s{#([0-9]*[A-Za-z_]+[0-9]*)}{<a href="http://twitter.com/search?q=%23$1"&gt;#$1&lt;/a&gt;}g;

It seems to work (let's hope), but I'm still curious... how would you do it?

EDIT: that regex i came up earlier was not correct! see below for a better answer :-)

+1  A: 

Your regexp wouldn't capture anchors that contain more than one letter separated by numbers, e.g. #a0a:

my @anchors = ($tweet =~ m/#(\w+)/g);
foreach my $anchor (@anchors)
{
    next unless $anchor =~ m/[a-z]/i;
    $tweet =~ s{#$anchor}{<a href="http://twitter.com/search?q=%23$anchor"&gt;#$anchor&lt;/a&gt;}g;
}

e.g. consider my $tweet = "hello #123 hello #abc1a hello #a0a";

Your code produces hello #123 hello <a href="http://twitter.com/search?q=%23abc1"&gt;#abc1&lt;/a&gt;a hello <a href="http://twitter.com/search?q=%23a9"&gt;#a0&lt;/a&gt;a

and mine produces hello #123 hello <a href="http://twitter.com/search?q=%23abc1a"&gt;#abc1a&lt;/a&gt; hello <a href="http://twitter.com/search?q=%23a9a"&gt;#a0a&lt;/a&gt;

Ether
A: 

I didn't realize how complex twitter text is! http://engineering.twitter.com/2010/02/introducing-open-source-twitter-text.html

I found these hashtag-related lines in the Ruby library that's linked in that blog post. Don't know much Ruby -- there may be more...

# Latin accented characters (subtracted 0xD7 from the range, it's a confusable multiplication sign. Looks like "x")
LATIN_ACCENTS = [(0xc0..0xd6).to_a, (0xd8..0xf6).to_a, (0xf8..0xff).to_a].flatten.pack('U*').freeze
REGEXEN[:latin_accents] = /[#{LATIN_ACCENTS}]+/o

# Characters considered valid in a hashtag but not at the beginning, where only a-z and 0-9 are valid.
HASHTAG_CHARACTERS = /[a-z0-9_#{LATIN_ACCENTS}]/io
REGEXEN[:auto_link_hashtags] = /(^|[^0-9A-Z&\/]+)(#|#)([0-9A-Z_]*[A-Z_]+#{HASHTAG_CHARACTERS}*)/io

I can't see a reason for handling `LATIN_ACCENTS' separately. If configured correctly, the \w shortcut should catch all those accented characters. Maybe it's different in Ruby... Maybe they had other reasons...

For now, I'm settling for something that looks like this

$tweet =~ s{#([0-9A-Z_]*[A-Z_]+\w+)}{<a href="http://twitter.com/search?q=%23$1"&gt;#$1&lt;/a&gt;}gi

Can't say that it's solved yet...

all_numeric_no_hash
A: 

Ether: Thanks a lot for the code, it really helped. There may be a bug in it? I don't know if it would handle this tweet correctly http://twitter.com/KateLeiter/status/12874298805 :-)

all_numeric_no_hash