views:

868

answers:

1

I am currently working on the HTML5 File API, and I need to get binary file data. The FileReader's readAsText, and readAsDataURL methods work fine, but readAsBinaryString returns the same data as readAsText.

I need binary data, but im getting a text string. Am I missing something?

+3  A: 

readAsBinaryString says that the data must be represented as a binary string, where:

...every byte is represented by an integer in the range [0..255].

JavaScript doesn't have a "binary" type and so they went with a String with the guarantee that no character stored in the String would be outside the range 0..255. (They could have gone with an array of Numbers instead, but they didn't; perhaps large Strings are more memory-efficient than large arrays of Numbers, since Numbers are floating-point.)

If you're reading a file that's mostly text in a western script (mostly English, for instance), then that string is going to look a lot like text. If you read a file with Unicode characters in it, you should notice a difference, since JavaScript strings are UTF-16 and so some characters will have values above 255, whereas a "binary string" according to the File API spec wouldn't have any values above 255 (you'd have two individual "characters" for the two bytes of the Unicode code point).

If you're reading a file that's not text at all (an image, perhaps), you'll probably still get a very similar result between readAsText and readAsBinaryString, but with readAsBinaryString you know that there won't be any attempt to interpret multi-byte sequences as characters. You don't know that if you use readAsText, because readAsText will use an encoding determination to try to figure out what the file's encoding is and then map it to JavaScript's UTF-16 strings.

You can see the effect if you create a file and store it in something other than ASCII or UTF-8. (In Windows you can do this via Notepad; the "Save As" as an encoding drop-down with "Unicode" on it, by which looking at the data they seem to mean UTF-16; I'm sure Mac OS and *nix editors have a similar feature.) Here's a page that dumps the result of reading a file both ways:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Show File Data</title>
<style type='text/css'>
body {
    font-family: sans-serif;
}
</style>
<script type='text/javascript'>

    function loadFile() {
        var input, file, fr;

        if (typeof window.FileReader !== 'function') {
            bodyAppend("p", "The file API isn't supported on this browser yet.");
            return;
        }

        input = document.getElementById('fileinput');
        if (!input) {
            bodyAppend("p", "Um, couldn't find the fileinput element.");
        }
        else if (!input.files) {
            bodyAppend("p", "This browser doesn't seem to support the `files` property of file inputs.");
        }
        else if (!input.files[0]) {
            bodyAppend("p", "Please select a file before clicking 'Load'");
        }
        else {
            file = input.files[0];
            fr = new FileReader();
            fr.onload = receivedText;
            fr.readAsText(file);
        }

        function receivedText() {
            showResult(fr, "Text");

            fr = new FileReader();
            fr.onload = receivedBinary;
            fr.readAsBinaryString(file);
        }

        function receivedBinary() {
            showResult(fr, "Binary");
        }
    }

    function showResult(fr, label) {
        var markup, result, n, aByte, byteStr;

        markup = [];
        result = fr.result;
        for (n = 0; n < result.length; ++n) {
            aByte = result.charCodeAt(n);
            byteStr = aByte.toString(16);
            if (byteStr.length < 2) {
                byteStr = "0" + byteStr;
            }
            markup.push(byteStr);
        }
        bodyAppend("p", label + " (" + result.length + "):");
        bodyAppend("pre", markup.join(" "));
    }

    function bodyAppend(tagName, innerHTML) {
        var elm;

        elm = document.createElement(tagName);
        elm.innerHTML = innerHTML;
        document.body.appendChild(elm);
    }

</script>
</head>
<body>
<form action='#' onsubmit="return false;">
<input type='file' id='fileinput'>
<input type='button' id='btnLoad' value='Load' onclick='loadFile();'>
</form>
</body>
</html>

If I use that with a "Testing 1 2 3" file stored in UTF-16, here are the results I get:

Text (13):

54 65 73 74 69 6e 67 20 31 20 32 20 33

Binary (28):

ff fe 54 00 65 00 73 00 74 00 69 00 6e 00 67 00 20 00 31 00 20 00 32 00 20 00 33 00

As you can see, readAsText interpreted the characters and so I got 13 (the length of "Testing 1 2 3"), and readAsBinaryString didn't, and so I got 28 (the two-byte BOM plus two bytes for each character).

T.J. Crowder
So it returns text encoded from 0-255? Is there a way to convert those characters to binary or hex data(0-1 or 0-FF)?
digitalFresh
@digitalFresh: The string *is* the binary data. As you were commenting, I posted an example which may help. JavaScript doesn't have a "binary" type and so they went with a String with the guarantee that no character stored in the string would be outside the range 0..255. (They could have gone with an array of numbers instead, but they didn't.) The example shows how to get the raw value of a "character" from the string.
T.J. Crowder