views:

43

answers:

1

Hi all,

I'm trying to build a Sanitize transformer that accepts potentially malformed HTML input with elements outside of any tags at all, such as in this example:

out of a tag<p>in a tag</p>out again!

I want to have the transformer wrap any non-tagged elements in <p> tags so that the above transforms into:

<p>out of a tag</p><p>in a tag</p><p>out again!</p>

Unfortunately, I can't figure out how to select the untagged element because it's not a node. I'm sure I'm missing something here. Can someone give me a nudge in the right direction?

+1  A: 
require 'nokogiri'

html = 'out of a tag<p>in a tag</p>out again!'

Nokogiri::HTML(html).at_css('body').children.
  map {|x| '<p>' + x.text + '</p>' }.join('')
#=> "<p>out of a tag</p><p>in a tag</p><p>out again!</p>"

Text is stored in text nodes. Because CSS cannot select text nodes, you will have to use other methods to get them like Nokogiri::XML::Node#children.

Adrian
Awesome, I'll give this a go. Thank you. :)
Aaron B. Russell
@Aaron: If this has answered your question, could you click the check icon near the top of my answer? It will mark this question as answered. Thanks!
Adrian
Hi Adrian,Your example works great in an irb session, but from the context of a transformer lambda, the lambda will only get called once, for the `<p>in a tag</p>` segment. :(Turned out it was best to just dump the idea of doing this inside a Sanitize transformer, and it was simpler to just run Nokogiri on its own on the whole html document.Points for you putting me on the right path, though! :)
Aaron B. Russell