views:

489

answers:

5

Does anyone know any good java library (or single method) that can strip extra spaces (line breaks, tabs, etc) from an html file? So html file gets turned into 1 line basically.

Thanks.

UPDATE: Looks like there is no library that does that so I created my own open source project for solving this task: http://code.google.com/p/htmlcompressor/

+4  A: 

Personally, I just enabled HTTP compression in the server and I leave my HTML readable.

But for what you want, you could just use String.replaceAll() with a regex that matching what you have specified. Off the top of my head, something like:

small=large.replaceAll("\s{2,}"," ");
Software Monkey
The only problem is that if you have a string that contains spaces, then those spaces will be erased as well. Also it will break alot of HTML formatting just for example "<table border=1.." would turn out as "<tableborder=1.." HTML parser will choke on that. :P
Suroot
@Suroot no, it's fine. It replaces multiple spaces with just one.
sblundy
@ sblundy but "Hello World" will become "Hello World" which isn't what you want if "Hello World" is what is supposed to be displayed.
TofuBeer
Well, that's some basic compression and that's what I am currently doing. It gets much deeper than that if you want to do it perfect and remove all possible characters (different rules apply for inside and outside of the tags). I think it is a common task and hope that someone already did it right.
serg
@Suroot Browsers convert multiple spaces to a single space. For example, your two "Hello Worlds" look the same. If you want multiple spaces, you need to use @nbsp;.
sblundy
Of course, if you rely on multiple spaces for formatting inside a <pre> tag, this will be fubared.
Evan
@Evan good point.
sblundy
Which is why I don't like screwing with the HTML, preferring HTTP compression.
Software Monkey
For HTML, compressing any multispaces to one within a tag, including attribute values, should have no impact. The only corner case is <pre> and tag content whose CSS class has pre-format behavior.
Software Monkey
+1  A: 
input.replaceAll("\s+", " ");

will convert any whitespace into a single space

cobbal
but it will also replace any single space with a single space, won't it? Which is wasted cycles.
Software Monkey
Of course, if you rely on multiple spaces for formatting inside a <pre> tag, this will be fubared.
Evan
+1  A: 

Assuming the desire is to make the HTML smaller to optimize the bytes sent over the network why not have the HTTP server do the work? Read here.

Will this work? Not free unfortunately.

TofuBeer
Already using it. I still would like to have a compression though.
serg
Does it have to be Java? DoOes it have to be free?
TofuBeer
There's no point at all in whitespace collapsing your HTML if you are applying HTTP compression - the end result will be so close as to not matter for the size of data across the wire. WS collapsing just adds another pre-deployment step.
Software Monkey
could be doing it to make it harder to read the source of the page...
TofuBeer
+2  A: 

Be careful with that. Text inside pre and textarea elements will be damaged. In addition, inlined javascript inside script elements will have to be ended with column;. Lastly if you code inlined javascript with html comments (to avoid some old browser buggy behavior) this will eventually comment out the whole inlined javascript code.

Why do you want to do that? If you want to decrease the download size of the html then all you need is a GZIP filter.

cherouvim
+1  A: 

Looks like there is no library that does that so I created my own open source project for solving this task, maybe someone will find it helpful: http://code.google.com/p/htmlcompressor/

serg