views:

642

answers:

3

Hi, everyone. i am working on school project and i have been struggling to clean all links in a feed using yahoo pipes.

For instance removing <a href="http://mickey.com"&gt;Go to Source</a> from my item.description.

Leaving the" Go to source" without the active link

I am using the regex module and i tried to use this expression

#</?a[^>]*>#iu

But no success. Please can someone help me with this.

A: 

HTML is a context free language, at least. It is impossible to correctly parse a CFL with regular expressions. Thus, it is not possible. Use a proper HTML parsing library and rework the DOM-Tree or the even stream (depending on the interface) in order to fit what you want to do.

Tetha
A: 

Essentially,what you want is:

<a.*?>(.*?)</a>

This will capture the link text in $1. ".*?" is a non-greedy match - meaning that is will match anything, but as few times a possible.

To be extra safe, you may want to accept some spaces in odd places and case options:

<\s*[Aa].*?>(.*?)<\s*/[Aa]\s*>

Even this is not bulletproof, but should handle most cases.

Don't forget the g and s options if you are using the "regex" module rather than the "string regex" one.

Gavin Brock
A: 

HTML is not a regular language, and cannot be matched by regular expressions. You can put something together that might match some of HTML, and will work sometimes, but will unexpectedly fail as soon as something goes a little strange.

Now, sadly, Yahoo Pipes does not appear to include an HTML parser. According to this blog entry, however, you can pipe your data through HTML Tidy, and then use their Fetch Data module which can parse XML to extract your data in a structured format. The tools for dealing with the XML afterwards are not ideal (they don't seem to support anything as useful as XPath or CSS selector queries), but at least you can deal with the data in a structured format that has been parsed by a proper HTML parser.

Brian Campbell