views:

263

answers:

2

I want to do THIS, just a little bit more complicated:

Lets say, I have an HTML input:

<a href="http://www.example.com" title="Bla @test blubb">Don't break!</a>
Some Twitter Users: @codinghorror, @spolsky, @jarrod_dixon and @blam4c.
You can't reach me at [email protected].

Is there a good RegEx to replace the twitter username mentions by links to twitter, but leave @example (eMail-Adress at the bottom) AND @test (in the link title, i.e. in HTML tags)?

It probably should also try to not add links inside existing links, i.e. not break this:

<a href="http://www.example.com"&gt;Hello @someone there!</a>

My current attempt is to add ">" at the beginning of the string, then use this RegEx:

Search:  '/>([^<]*\s)\@([a-z0-9_]+)([\s,.!?])/i'
Replace: '>\1<a href="http://twitter.com/\2"&gt;@\2&lt;/a&gt;\3'

Then remove the ">" I added in step 1.

But that won't match anything but the "@blam4c". I know WHY it does so, that's not the problem.

I would like to find a solution that finds and replaces all twitter user name mentions without destroying the HTML. Maybe it might even be better to code this without RegEx?

+3  A: 

First, keep the angle brackets out of your regexps.

Use a HTML parser and xpath to select the text nodes you are interested in processing, then consider a regexp for matching only @refs in those nodes.

I'll let to other people to try and give a specific answer to the regex part.

ddaa
I was afraid you might say this, because it was the same result I came to ;)
BlaM
+2  A: 

I agree with ddaa, there's almost no sane way to attack this without stripping the html links out first.

Presumably you'd be starting out with an actual Twitter message, which cannot by definition include any manually entered hyperlinks.

For example, here's how I found this question (the link resolves to this question so don't bother clicking it!)

Some Twitter Users: @codinghorror, @spolsky, @jarrod_dixon and @blam4c. http://bit.ly/2phvZ1

In this case, it's easy:

var msg = "Some Twitter Users: @codinghorror, @spolsky, @jarrod_dixon and @blam4c. http://bit.ly/2phvZ1";

var html = Regex.Replace(msg, "(?<!\w)(@(\w+))", 
    "<a href=\"http://twitter.com/$2\"&gt;$1&lt;/a&gt;");

(this might need some tweaking, I'd like to test it against a corpus, but it seems correct for the average Twitter message)

As for your more complicated cases (with HTML markup embedded in the tweets), I have no idea. Way too hard for me.

Jeff Atwood