views:

324

answers:

2

We are using JTidy to clean up some html for sax processing. We've had a lot of trouble around spacing issues as shown in this example:

Html

<i>stack<span
class="bold">overflow</span></i>

which outputs "stackoverflow"

But...

Post JTidy

<i>stack
<span
class="bold">overflow</span></i>

which outputs "stack overflow" (note the new space)

Anyone have any advice to fix/handle this better. I've been through all the Tidy/JTidy settings and don't see anything to account for this issue.

A: 

What settings are you using? Executing JTidy from the command line using its default settings on the snippet you posted prints this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<title></title>
</head>
<body>
<i>stack<span class="bold">overflow</span></i>
</body>
</html>
laz
Agreed. This simple example does work. Our actual content has much longer lists of style attributes, which pointed me to the "wrapping" settings of tidy. Sure enough, that fixed the problem.
jfeust
+1  A: 

Turns out this simple example doesn't really show the issue. The actual issue was that Tidy/JTidy was using a default wrapping setting which was causing the above issue (and other various spacing issues) when there were very long attribute values.

Everything was fixed with:

 jtidy.setWraplen(0);
 jtidy.setWrapAttVals(false);
jfeust