ansaurus

Question

Using JavaScript to truncate text to a certain size (8 KB)

Answer 1

+2 A:

No it's not safe to assume that 8KB of text is 8192 characters, since in some character encodings, each character takes up multiple bytes.

If you're reading the data from files, can't you just grab the filesize? Or read it in in chunks of 8KB?

Dominic Rodger 2009-10-04 08:11:01

Thanks, Dominic - I'm gathering this text from a document using JavaScript's .innerText() method (rather than a .txt file or something), so I'm not sure that there's a way to specify "give me 8 KB of data" - that's ideally what I'm looking for, though.

Bungle 2009-10-04 08:22:28

Answer 2

A:

As Dominic says, character encoding is the problem - however if you can either really ensure that you'll only deal with 8-bit chars (unlikely but possible) or assume 16-bit chars and limit yourself to half the available space, i.e. 4096 chars then you could attempt this.

It's a bad idea to rely on JS for this though because it can be trivially modified or ignored and you have complications of escape chars and encoding to deal with for example. Better to use JS as a first-chance filter and use whatever server-side language you have available (which will also open up compression).

annakata 2009-10-04 08:23:52

Thanks, annakata - it looks like bobince's functions will work in my case. Zemanta should actually just cut off any text over the 8 KB limit, so I'm less concerned about what eventually gets to their API (aside from conserving bandwidth, of course), as the maximal performance gains in this instance will come in limiting to at least roughly 8 KB on the client side.

Bungle 2009-10-04 20:16:04

Answer 3

+4 A:

If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

Now you can truncate with:

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.

bobince 2009-10-04 13:36:30

This is exactly what I was looking for - works like a charm! Thanks, bobince. Just a quite note for posterity - I'm a little dense so it took me a few minutes to realize that the variables "unicodecharacters" and "utf8bytes" in your functions are just for explanation's sake, and should actually match the arguments to work (i.e., both should be replaced with "s" in the two shorter functions). Thanks again!

Bungle 2009-10-04 20:12:48

Whoops! The perils of cut-and-paste, there. Ta for the catch!

bobince 2009-10-04 20:40:19

ansaurus

tags:

views:

answers:

Using JavaScript to truncate text to a certain size (8 KB)

related questions