tags:

views:

109

answers:

3

We deal with alot of UGC (1m+/mo) and sometimes our users will input large strings with no spaces which causes web browsers to display content in a strange manner, breaking UI here and there.

I am trying to find a way to intelligently and quickly process text up to 50k and insert tags where appropriate.

I have already built this, but the JVM seems to crap out on larger strings (somewhere around 20k it chokes) so I was thinking about use a Perl script to do the modification and call it from Java but I do not know how to write Perl :(

Is there any libraries out there that do this? Has anyone run into this issue?

A: 

What do you mean by "chokes"? Takes too long? Throws an exception?

At any rate, 20K is nothing; the problem most likely is in your code. If you get an exception (or JVM crashes), can you post an appropriate stack trace? If it takes too long, did you profile it? Can you post the results? Seeing some source code would help too.

You are using StringBuffer and / or StringBuilder for this rather than manipulating the String directly, right?

ChssPly76
+1  A: 

TIMTOWDI with Perl, but I like:

$newstring;
$string = $incrediblylongstring;
for($i=0;$i<length($string);$i+100){
    $rest = substr($string, $i, 100);
    $newstring .= '<br />'.$rest;
}

But, you can also have a more intuitive editor, allowing the client to put enters themselves with javascript. Pseudocode being, when the editor has focus, capture enter to insert <br />. Right after the word typed I typed
See. :D

Elizabeth Buckwalter
`<wbr>` is not `<br>`: http://www.quirksmode.org/oddsandends/wbr.html
Sinan Ünür
Oops. Just goes to show that I don't know everything. Perlcode still applies, though, I knew someone would come up with a oneliner or almost one liner
Elizabeth Buckwalter
A: 
#!/usr/bin/perl

use strict;
use warnings;

my $long_string = join ' ', map { 'a' x rand 20_000 } 1 .. 100;

# adjust 40 according to taste
$long_string =~ s{(\S{40})}{$1<wbr/>}g;

print $long_string, "\n";

The whole takes about 0.25 seconds to do all the substitutions for an approx 1,000,000 character string.

Sinan Ünür