I'm trying to build a module that transforms HTML and XML via XSLT. I'm using the latest stable version of JTidy. I've pasted the code that deals with JTidy below:
if (in != null) {
if (convertToXhtml) {
Document d = null;
try {
Tidy htmlSanitizer = new Tidy();
htmlSanitizer.setXmlTags(false);
htmlSanitizer.setShowWarnings(false);
htmlSanitizer.setInputEncoding("UTF-8");
htmlSanitizer.setOutputEncoding("UTF-8");
htmlSanitizer.setXHTML(true);
d = htmlSanitizer.parseDOM(in, null);
} catch (Exception e) {
// [...]
}
return new DOMSource(d);
} else {
return new StreamSource(in);
}
} else {
return null;
}
The purpose of the convertToXhtml
is to differentiate between the target of the xsl transform and other input that may be received. It is set to true
for both XML and HTML and I can't change that without having to refactor a lot of code.
What I want to find out is if there's any way to make JTidy leave XML alone, and only try to sanitize HTML. I've tried with the setXmlTags
directive, but if I set it to true
, the existing HTML transforms break and the XML still throws an error. If I leave it to false, the HTML transforms come out fine, but when XML is entered into Tidy, it throws errors for each tag it doesn't recognize (line 2 column 1 - Error: <article> is not recognized!
) and when the resulting DOMSource
is passed to the XSLT it throws:
javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: -1
So:
- Can make it leave XML alone, while still doing the HTML transforms correctly?
- Can I tell somehow if it was an HTML or an XML and bypass JTidy?
Below is the Tidy configuration in my system (gotten from htmlSanitizer.getConfiguration().printConfigOptions()
):
Name Type Current Value
=========================== ========= ========================================
add-xml-decl Boolean no
add-xml-pi Boolean no
add-xml-space Boolean no
alt-text String
ascii-chars Boolean yes
assume-xml-procins Boolean no
bare Boolean no
break-before-br Boolean no
char-encoding Encoding UTF8
clean Boolean no
css-prefix Name
doctype DocType auto
drop-empty-paras Boolean yes
drop-font-tags Boolean no
drop-proprietary-attributes Boolean no
enclose-block-text Boolean no
enclose-text Boolean no
error-file Name
escape-cdata Boolean yes
fix-backslash Boolean yes
fix-bad-comments Boolean yes
fix-uri Boolean yes
force-output Boolean no
gnu-emacs Boolean no
hide-comments Boolean no
hide-endtags Boolean no
indent Indent false
indent-attributes Boolean no
indent-cdata Boolean no
indent-spaces Integer 2
input-encoding Encoding UTF8
input-xml Boolean no
join-classes Boolean no
join-styles Boolean yes
keep-time Boolean yes
language Name
literal-attributes Boolean no
logical-emphasis Boolean no
lower-literals Boolean yes
markup Boolean yes
ncr Boolean yes
new-blocklevel-tags Tag names
new-empty-tags Tag names
new-inline-tags Tag names
new-pre-tags Tag names
newline Enum crlf
numeric-entities Boolean no
only-errors Boolean no
output-encoding Encoding UTF8
output-html Boolean no
output-raw Boolean no
output-xhtml Boolean yes
output-xml Boolean no
quiet Boolean no
quote-ampersand Boolean yes
quote-marks Boolean no
quote-nbsp Boolean yes
repeated-attributes Enum keep-last
replace-color Boolean no
show-body-only Boolean no
show-errors Integer 6
show-warnings Boolean no
slide-style Name
split Boolean no
tab-size Integer 8
tidy-mark Boolean yes
trim-empty-elements Boolean yes
uppercase-attributes Boolean no
uppercase-tags Boolean no
word-2000 Boolean no
wrap Integer 68
wrap-asp Boolean yes
wrap-attributes Boolean no
wrap-jste Boolean yes
wrap-php Boolean yes
wrap-script-literals Boolean no
wrap-sections Boolean yes
write-back Boolean no