cyberneko

Cleaning mixed type <script> tags

I'm cleaning HTML using cyberneko and xerces. However , some $#@@!@@ websites still use BOTH <script>...</script> and <script.../> So what happens is this : given <script..../> <div> Some Text </div> <script> scripting stuff </script> , neko parses all the above line as a script , so I get <script..../> &lt div &gt Some Text...

XmlSlurper/NekoHTML document fragment parsing - No HTML or BODY tags wanted

Dear All, I am trying to parse the following HTML fragment, and I would like to get the same fragment as output (without HTML and BODY tags). Is this possible? If so, how? Thank you Misha p.s. I am reading here: http://nekohtml.sourceforge.net/faq.html#fragments and I believe I have added the correct options below. However, the output ...

Whether nekohtml handles inline elements containing block elements

while parsing html source with nekohtml whether it parses anchor tag with block elements like div h1 etc.. correctly. For example: (HTML source) <a href="http://www.abc.com"&gt;link&lt;div&gt;example&lt;a href="http://www.ghj.com"&gt;ghj link</a></div><h1>link here</h1></a> Expected Result(After parsing) <a href="ht...

serialize a NekoHTML ElementNSImpl object back to HTML/XML

Hi, Does anyone know if there is a straightforward way to serialize a parsed cyberneko ElementNSImpl object? Here is my example in Clojure of serializing the whole DOM (an HTMLDocumentImpl object). This works, but I have not yet figured out how to do this for an element from the dom (ElementNSImpl). (defn dom->xml [dom] (let [sw ...