tags:

views:

51

answers:

0

I'm trying to build a module that transforms HTML and XML via XSLT. I'm using the latest stable version of JTidy. I've pasted the code that deals with JTidy below:

    if (in != null) {
        if (convertToXhtml) {
            Document d = null;
            try {
                Tidy htmlSanitizer = new Tidy();
                htmlSanitizer.setXmlTags(false);
                htmlSanitizer.setShowWarnings(false);
                htmlSanitizer.setInputEncoding("UTF-8");
                htmlSanitizer.setOutputEncoding("UTF-8");
                htmlSanitizer.setXHTML(true);
                d = htmlSanitizer.parseDOM(in, null);
            } catch (Exception e) {
                // [...]
            }
            return new DOMSource(d);
        } else {
            return new StreamSource(in);
        }
    } else {
        return null;
    }

The purpose of the convertToXhtml is to differentiate between the target of the xsl transform and other input that may be received. It is set to true for both XML and HTML and I can't change that without having to refactor a lot of code.

What I want to find out is if there's any way to make JTidy leave XML alone, and only try to sanitize HTML. I've tried with the setXmlTags directive, but if I set it to true, the existing HTML transforms break and the XML still throws an error. If I leave it to false, the HTML transforms come out fine, but when XML is entered into Tidy, it throws errors for each tag it doesn't recognize (line 2 column 1 - Error: <article> is not recognized!) and when the resulting DOMSource is passed to the XSLT it throws:

javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: -1

So:

  • Can make it leave XML alone, while still doing the HTML transforms correctly?
  • Can I tell somehow if it was an HTML or an XML and bypass JTidy?

Below is the Tidy configuration in my system (gotten from htmlSanitizer.getConfiguration().printConfigOptions()):

Name                        Type       Current Value
=========================== =========  ========================================
add-xml-decl                Boolean    no
add-xml-pi                  Boolean    no
add-xml-space               Boolean    no
alt-text                    String     
ascii-chars                 Boolean    yes
assume-xml-procins          Boolean    no
bare                        Boolean    no
break-before-br             Boolean    no
char-encoding               Encoding   UTF8
clean                       Boolean    no
css-prefix                  Name       
doctype                     DocType    auto
drop-empty-paras            Boolean    yes
drop-font-tags              Boolean    no
drop-proprietary-attributes Boolean    no
enclose-block-text          Boolean    no
enclose-text                Boolean    no
error-file                  Name       
escape-cdata                Boolean    yes
fix-backslash               Boolean    yes
fix-bad-comments            Boolean    yes
fix-uri                     Boolean    yes
force-output                Boolean    no
gnu-emacs                   Boolean    no
hide-comments               Boolean    no
hide-endtags                Boolean    no
indent                      Indent     false
indent-attributes           Boolean    no
indent-cdata                Boolean    no
indent-spaces               Integer    2
input-encoding              Encoding   UTF8
input-xml                   Boolean    no
join-classes                Boolean    no
join-styles                 Boolean    yes
keep-time                   Boolean    yes
language                    Name       
literal-attributes          Boolean    no
logical-emphasis            Boolean    no
lower-literals              Boolean    yes
markup                      Boolean    yes
ncr                         Boolean    yes
new-blocklevel-tags         Tag names  
new-empty-tags              Tag names  
new-inline-tags             Tag names  
new-pre-tags                Tag names  
newline                     Enum       crlf
numeric-entities            Boolean    no
only-errors                 Boolean    no
output-encoding             Encoding   UTF8
output-html                 Boolean    no
output-raw                  Boolean    no
output-xhtml                Boolean    yes
output-xml                  Boolean    no
quiet                       Boolean    no
quote-ampersand             Boolean    yes
quote-marks                 Boolean    no
quote-nbsp                  Boolean    yes
repeated-attributes         Enum       keep-last
replace-color               Boolean    no
show-body-only              Boolean    no
show-errors                 Integer    6
show-warnings               Boolean    no
slide-style                 Name       
split                       Boolean    no
tab-size                    Integer    8
tidy-mark                   Boolean    yes
trim-empty-elements         Boolean    yes
uppercase-attributes        Boolean    no
uppercase-tags              Boolean    no
word-2000                   Boolean    no
wrap                        Integer    68
wrap-asp                    Boolean    yes
wrap-attributes             Boolean    no
wrap-jste                   Boolean    yes
wrap-php                    Boolean    yes
wrap-script-literals        Boolean    no
wrap-sections               Boolean    yes
write-back                  Boolean    no