ansaurus

Question

Regex to strip out links using Yahoo Pipes

Answer 1

A:

HTML is a context free language, at least. It is impossible to correctly parse a CFL with regular expressions. Thus, it is not possible. Use a proper HTML parsing library and rework the DOM-Tree or the even stream (depending on the interface) in order to fit what you want to do.

Tetha 2009-12-13 21:41:29

Answer 2

A:

Essentially,what you want is:

<a.*?>(.*?)</a>

This will capture the link text in $1. ".*?" is a non-greedy match - meaning that is will match anything, but as few times a possible.

To be extra safe, you may want to accept some spaces in odd places and case options:

<\s*[Aa].*?>(.*?)<\s*/[Aa]\s*>

Even this is not bulletproof, but should handle most cases.

Don't forget the g and s options if you are using the "regex" module rather than the "string regex" one.

Gavin Brock 2010-01-08 16:24:46

Answer 3

A:

HTML is not a regular language, and cannot be matched by regular expressions. You can put something together that might match some of HTML, and will work sometimes, but will unexpectedly fail as soon as something goes a little strange.

Now, sadly, Yahoo Pipes does not appear to include an HTML parser. According to this blog entry, however, you can pipe your data through HTML Tidy, and then use their Fetch Data module which can parse XML to extract your data in a structured format. The tools for dealing with the XML afterwards are not ideal (they don't seem to support anything as useful as XPath or CSS selector queries), but at least you can deal with the data in a structured format that has been parsed by a proper HTML parser.

Brian Campbell 2010-01-08 17:13:36

ansaurus

tags:

views:

answers:

Regex to strip out links using Yahoo Pipes

related questions