views:

80

answers:

2

I want to clean an HTML page of its tags, using Ruby. I have the raw HTML, and would like to define a list of tags, e.g. ['span', 'li', 'div'], and create an array of regular expressions that I could run sequentially, so that I have

clean_text = raw.gsub(first_regex,' ').gsub(second_regex,' ')...

with two regular expressions per tag (start and end).

Do I have a way to do this programmatically (i.e. pre-build the regex array from a tag array and then run them in a fluent pattern)?

EDIT: I realize I actually asked two questions at once - The first about transforming a list of tags to a list of regular expressions, and the second about calling a list of regular expressions as a fluent. Thanks for answering both questions. I will try to make my next questions single-themed.

+1  A: 

Assuming you have a build_regex method to turn a tag into a regex, this should do it:

tags = %w(span div li)
clean_text = tags.inject(raw) {|text, tag| text.gsub build_regex(tag), ' ' }

The inject call passes the result of each substitution into the next iteration of the block, giving the effect of running each gsub on the string one by one.

Daniel Lucraft
+1  A: 

This should produce a single regexp to remove all your tags.

clean_text = raw.gsub(/<\/?(#{tags.join("|")})>/, '')

However, you have to improve it to support tags with attributes (e.g. <a href="...">), currently only simple tags are removed (e.g. <a>)

Ropez
this will naively improve it: /<\/?(#{tags.join("|")})[^>]*>/ -- will break if any attribute value contains a '>'
glenn jackman