tags:

views:

858

answers:

4

I would like to grab all the hashtags using PHP from http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i

The hashtags are in the content, title nodes within the RSS feed. They are prefixed with #

The problem I am having is with non-English letters (outside of the range a-zA-Z).

If you look at the RSS feed and then view the html source my struggle might be clearer.

    <title>And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect</title>

Do I need to do some something with the title node before I find my rexexp matches.

My ultimate aim is to replace the hashtag with the twitter search url e.g. http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i

Here is some sample code to help you along.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<html xmlns="http://www.w3.org/1999/xhtml&quot; xml:lang="en" lang="en">

<body>
<?php 
$title="And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect";

// this is the regexp that hashtags.org use (http://twitter.pbwiki.com/Hashtags)
$r = preg_replace("/(?:(?:^#|[\s\(\[]#(?!\d\s))(\w+(?:[_\-\.\+\/]\w+)*)+)/"," <a href=\"http://search.twitter.com/search?q=%23\1\&quot;&gt;\1&lt;/a&gt; ", $title);
echo "<p>$r</p>";

$r = preg_replace("/(#.+?)(?:(\s|$))/"," <a href=\"http://search.twitter.com/search?q=\1\&quot;&gt;\1&lt;/a&gt; ", $title);
echo "<p>$r</p>";

// This is my desired end result
echo "<p><a href=\"http://search.twitter.com/search?q=%23eu-jeleġġi\&quot;&gt;#eu-jeleġġi&lt;/a&gt;&lt;/p&gt;&quot;;
?>

</body>
</html>

Any advice or solution would be greatly appreciated.

+2  A: 

Grab a '#' plus all characters until you hit a whitespace character:

(#.+?)(?:\s)

Or a little more flexible (allows end of string) :

(#.+?)(?:(\s|$))
Rex M
+8  A: 

Or just

(#\S+)
sysrqb
A: 

Why are you using a regexp? Remove anything that's not preceded by a hash, then explode by hash. Regexp seems unnecessarily complicated and ill-suited to the problem.

Perhaps you can explain further why this needs to be done in a regexp?

Adam Davis
Adam this is a fair comment. I have updated my original post with a little more detailed requirement.
Michael
+1  A: 

heres what i would use :)

(?<![^\s#])(#[^\s#]+)(?=(\s|$))

example matching on this string

#test #test#test #test-test test#test

hope this is helpful

Chad Scira