tags:

views:

179

answers:

7

Sometimes when you download a compiled binary file with the wrong mime type or for example running the "more" command on a binary file you get a bunch of "garbly gook" for lack of a better term.

For example this a snippet of what I see when I run "more" from the command line on a very simple C program compiled with gcc on OS X.

<94>^^^@^@ESC^@^@^@^^^A^@^@<A8>^^^@^@.^@^@^@^N^D^@^@^P ^@^@@^@^@^@^O^D^@^@^L ^@^@H^@^@^@^O^D^@^@^H ^@^@P^@^@^@^O
^D^@^@^@ ^@^@\^@^@^@^C^@^P^@^@^P^@^@p^@^@^@^O^A^@^@b^_^@^@y^@^@^@^O^D^@^@^D ^@^@<82>^@^@^@^O^A^@^@<B6>^^^@^@<88>
^@^@^@^O^A^@^@T^_^@^@<8D>^@^@^@^O^A^@^@T^^^@^@<93>^@^@^@^A^@^A^B^@^@^@^@<99>^@^@^@^A^@^A^B^@^@^@^@^L^@^@^@^M^@^@
^@ ^@dyld_stub_binding_helper^@__dyld_func_lookup^@dyld__mach_header^@_NXArgc^@_NXArgv^@___progname^@__mh_execute
_header^@_average^@_environ^@_main^@_sum^@start^@_exit^@_printf^@^@^@^@

Can someone explain in simple terms why this is? What is happening when a text editor or the plain text mime type is trying to interpret binary data? Does the ^@ mean anything in this context? Why is there some text and some garbly gook? Is there any standard for the way this binary data is represented in text? Why is it not simple 1 and 0s?

I can conceptually understand ascii or unicode as a representation of characters in a number system that can be reduced down to binary 1's and 0 and a number system that the CPU understands. But at a higher level I am trying to get my head around what binary data is. I guess I want to "see the abstraction", if that makes sense.

Is there a way to "see" binary data in any kind of meaningful way in a text editor?

+4  A: 

There really isn't a significant difference between text and binary files, save for the range of values used within the files. Each value is converted to a character (in a basic text editor) based on the code page used (ASCII, ANSI).

You're seeing the character "^@" because the value of the byte in the file at that position is 0 (the nul character). The nul character is not printable, and so the more program is displaying it using caret notation.

You can open the file in a hex editor, which is a text editor that is more sensitive to binary data. I am not very familiar with Mac software, but a free hex editor can be downloaded at http://hexedit.sourceforge.net/.

Basic text editors/viewers assume that anything you open with it is meant to be read as plain text.

EDIT: Incorporated Mike Spross's corrections re: ^@.

David Andres
I'm trying to understand this too, why is it showing the hex values at all? Why doesn't it simply show the 1's and 0's? Additionally: how can I make it show the 1's and 0's?
Nona Urbiz
@Nona: I'm not really aware of programs that show 0s and 1s, but please be aware that hex values (base 16) are shorthand for binary data (base 2). You can always convert a base 16 value to its base 2 equivalent. Just curious, but do you need to see the 0s and 1s for a specific purpose?
David Andres
Actually, the `^@` represents a `'\0'` character (a byte with a value of 0). In the OP's case, more is displaying the non-printable characters in the file using caret notation. See http://en.wikipedia.org/wiki/Caret_notation.
Mike Spross
@Mike Spross: Thanks for the clarification. I've added this detail to the answer.
David Andres
Thanks for the response. That clears things up quite a bit.
Gordon Potter
+1  A: 

I suggest using the od command on a Unix system. It's not a text editor, but it's still good for analyzing the content of the files. If most of the characters are printable, you could use od -c file.

LE: GNU od(1) man page

Cristian Ciupitu
Thanks the od command tip. Tried it out. Interesting tool.
Gordon Potter
+2  A: 

Is there a way to "see" binary data in any kind of meaningful way in a text editor?

I suggest a hex format! For example, these are the recommendations for editing binary files in VIM...:

USING XXD

A real binary editor shows the text in two ways: as it is and in hex format. You can do this in Vim by first converting the file with the "xxd" program. This comes with Vim. First edit the file in binary mode:

vim -b datafile

Now convert the file to a hex dump with xxd:

:%!xxd

The text will look like this:

0000000: 1f8b 0808 39d7 173b 0203 7474 002b 4e49  ....9..;..tt.+NI   
0000010: 4b2c 8660 eb9c ecac c462 eb94 345e 2e30  K,.`.....b..4^.0      
0000020: 373b 2731 0b22 0ca6 c1a2 d669 1035 39d9  7;'1.".....i.59.

You can now view and edit the text as you like. Vim treats the information as ordinary text. Changing the hex does not cause the printable character to be changed, or the other way around. Finally convert it back with:

:%!xxd -r

Only changes in the hex part are used. Changes in the printable text part on the right are ignored.

See the manual page of xxd for more information.

Alex Martelli
thanks for the vim tips and XXD. With help my investigations and curiosity.
Gordon Potter
+1  A: 

Is there a way to "see" binary data in any kind of meaningful way in a text editor?

In short, no. Binary data can mean absolutely anything, and there is no way that a dumb text editor can figure it out. (Indeed, even a smart human cannot figure it out with absolute certainty.)

The normal way to deal with this on a Unix / Linux system is to use the "file" command line utility. This looks at the start of the file and applies heuristics to give you a "best guess" at the file type. Based on that, you see if you can find an appropriate tool to view the file contents. If you don't have a viewer / editor / decompiler, etc that understands the format, the "od" utility can show it to you in various forms; e.g. in hexadecimal, octal, as characters, etcetera.

EDIT: to elaborate on "Binary data can mean absolutely anything":

  • A binary bit pattern that that is output by (say) a compiler cannot be distinguished from the identical binary bit pattern output by (say) some random user-defined application. It is theoretically impossible to distinguish between the cases without incontravertable external knowledge of the process, as I stated above.

  • Recognition of binary bit patterns (e.g. as done by the "file" program) is typically based on detecting "magic numbers" in the first few bytes of a file. So for example, the "magic" for an executable script file is "#!" in the first two bytes. If you write an application that generates a binary file that might have "#!" as its first two characters, this may cause "file" to give false matches, and label your binary files as scripts

Thus, any recognition of binary file types based solely on their content is uncertain from both the theoretical and practical standpoints.

But even certain binary file type does not solve the problem. The hard part is that some person has to write a converter for each binary file type that will extract and render the meaning of the file. For some file types, these converters / renderers already exist. For example, there are disassemblers / decompilers for many forms of executable code file formats. But no such converter exists for all binary file types, and the converters that do exist are typically stand-alone applications, not plugin modules for your favourite text editor.

Stephen C
Thanks for the response. "Binary data can mean absolutely anything, and there is no way that a dumb text editor can figure it out. (Indeed, even a smart human cannot figure it out with absolute certainty.)" I imagine this a factor of time and memory. Obviously a computer can analyze much faster. So yeah this makes sense to me.
Gordon Potter
@Gordon. What I mean is that it is literally unknowable! Binary data is just bits. Without knowing what process produced those bits, it is theoretically impossible to know with certainty what they mean.
Stephen C
What about structure if you are able to see the whole (a single binary file for example) then a pattern can be understood? No? But I think I get your larger point here. Flip a single bit and the meaning could be dramatically different depending on where the bit was in the sequence. So that is where the uncertainty lies. So is this to say that processors are completely naive in their operation? One bit follows the next and the processor just follows the chain awaiting instructions in the sequence.
Gordon Potter
+2  A: 

Hello Gordon,
Binary files and text files are all the same thing for a computer, after all they are all 0's and 1's. The way that you see the content of the file depends on the program you use to view it.
Text editors (try to) interpret the 0's and 1's into characters, and show you the characters they get, which you can view as a document. They make an assumption that the files you are giving them are text files, containing ASCII characters. However this is not true for computer files in general, as they can contain any kind of binary data, which is not necessarily ASCII characters. When this happens, instead of giving you an error message, some text editors give you an ugly and incorrect representation of the data in the file (as they do not understand the data anyway).
Hex editors are more of a tool for geeks, as they also give you the computer data in hex (a more readable format compared to binary). Some hex editors also give you the ASCII characters they detect, so it's event more convenient.
Alex gave you a very cool command line tool, but if you want some GUI a quick google with "hex editor" will give you too many softwares to try.

phunehehe
Thanks for the explanation.
Gordon Potter
+1  A: 

The binary representation of data (just ones and zeroes) would require too much screen space.

Hex or ascii equivalents are more concise, and our brains prefer that.

We should treat a combined hex / ascii display (produced by the od command, for example) as an attempt to show what the data WOULD look like it it was meant to be hex data and what it WOULD look like it it was meant to be TEXT.

But, as Stephen C said, no text editor can accurately decide what the bytes are meant to be, so it provides only a hint.

It's up to the user to look at a display and decide whether the data is text or binary or some mixture of the two

Binary files sometimes contain a few series of text characters. Especially if the binary file is an executable and has to produce output. The output messages will be stored inside the binary file as sequences of text characters. It's very useful to be able to see what the sequences of text inside a binary file are, and where they're stored.

pavium
Thanks for the response. "We should treat a combined hex / ascii display (produced by the od command, for example) as an attempt to show what the data WOULD look like it it was meant to be hex data and what it WOULD look like it it was meant to be TEXT." I like this explanation a lot with the subjunctive case of "WOULD". That solidifies things in my mind a bit more.
Gordon Potter
+1  A: 

On a computer, all data is stored in binary, including text files. This means everything is stored using binary bits. There are only two possible binary bits: one and zero.

A text file needs to differentiate between more than two different symbols, so it groups a sequence of binary bits into a more complex unit. For example, a sequence of 8 bits can be interpreted as one ASCII character (values range from 0 to 255).

Since text files are internally just a series of binary bits (ones and zeros) any series of binary bits can be interpreted as a text file. The output in your example is the result of trying to interpret the binary bits of an executable file as a text file. Most of the characters are junk (don't make sense as a sequence of ASCII characters), but there some parts that make sense because they were stored as ASCII strings.

Each file format has a contract for what the binary bits represents. In the case of the executable file, it is much more complex than a simple text file, but the executable file format also includes parts that store ASCII strings like a text file does.

If you view a file using a hex editor, you can see both the binary representation of the file and ASCII text interpretation of the binary side-by-side. Note the binary representation displays the data in a more compact form: hexidecimal. A sequence of 4 binary bits is represented with one hexidecimal digit that ranges from 0 to F.

Leftium
Thanks for the ASCII explanation your explanation makes a lot of sense to me.
Gordon Potter