views:

251

answers:

6

I have a file that has both ASCII text and binary content. I would like to extract the text without having to parse the binary content as the binary content is 180MB. Can I simply extract the text for further manipulation ... what would be the best way of going about it.

The ASCII is at the very beginning of the file.

+1  A: 

I am not aware of any Java classes that will read the ASCII characters and ignore the rest, but the easiest thing I can come up with here is to use the strings utility (assuming you are on a Unix-based system).

SYNOPSIS strings [ - ] [ -a ] [ -o ] [ -t format ] [ -number ] [ -n number ] [--] [file ...]

DESCRIPTION Strings looks for ASCII strings in a binary file or standard input. Strings is useful for identifying random object files and many other things. A string is any sequence of 4 (the default) or more printing characters ending with a newline or a null. Unless the - flag is given, strings looks in all sections of the object files except the (_TEXT,_text) section. If no files are specified standard input is read.

You could then pipe the output to another file and do whatever you want with it.

Edit: with the additional information that all the ASCII comes at the beginning, it would be a little easier to extract the text programmatically; still, this is faster than writing code.

danben
I am writing a Java web app
Ankur
You will be taking multiple such files as inputs?
danben
Yes there will be many files, over a long period of time.
Ankur
+1  A: 

Assuming you can tell where the end of the ASCII content is, just read characters from the file until you find the end of it, and close the file.

Anon.
The issue is figuring out how to tell where the end of the ASCII content is
Ankur
There isn't an easy way. The best you can do is stop when you encounter the first non-printable character (because you know that's not going to be in the ASCII section), but then you're still likely to pick up some garbage from the start of the binary section before that. It would be best if you knew the exact structure of the binary section - say, if it always started with the same character sequence. Then you could look for that to determine where the ASCII section ends.
Anon.
+1  A: 

Supposing that there is some token which divides the file into the binary and ASCII components (say, "#END#" on a line all by itself), you can do sometihng like the following:

import java.io.*;

// ...

public static void main(String args[]) {
  try {
    FileInputStream f = new FileInputStream("object.bin");
    DataInputStream d = new DataInputStream(f);
    BufferedReader b = new BufferedReader(new InputStreamReader(d));

    String s = "";
    while ((s = b.readLine()) != "#END#") {
      // ASCII contents parsed here.
      System.out.println(s);
    }

    d.close();
  } catch (Exception e) {
      System.err.println("kablammo! " + e.getMessage());
  }
}
John Feminella
€ Seems to be the first character very often, perhaps I could use that.
Ankur
+1  A: 

Have a method that checks whether a particular character meets your criteria (here, I've covered characters that are found on the keyboard). Once you hit a character for which the method returns false, you know you've hit the binary. Note that valid ASCII characters may also form part of the binary so you may end up with a few extra characters at the end.

static boolean isAsciiCharacter(char c) {
    return (c >= ' ' && c <= '~') ||
            c == '\n' ||
            c == '\r';
}
lins314159
Thanks, that will be very useful
Ankur
+2  A: 

There are 4 libraries to read FITS files in Java here:

Java

nom.tam.fits classes

A Java FITS library has been developed which provides efficient -- at least for Java -- I/O for FITS images and binary tables. The Java libraries support all basic FITS formats and gzip compressed files. Support for access to data subsets is included and the HIERARCH convention may be used.

eap.fits

Includes an applet and application for viewing and editing FITS files. Also includes a general purpose package for reading and writing FITS data. It can read PGP encrypted files if the optional PGP jar file is available.

jfits

The jfits library supports FITS images and ASCII and binary tables. In-line modification of keywords and data is supported.

STIL

A pure java general purpose table I/O library which can read and write FITS binary tables amongst other table formats. It is efficient and can provide fast sequential or random read access to FITS tables much larger than physical memory. There is no support for FITS images.

OscarRyz
+1  A: 

The first 2880 bytes of a FITS file are ASCII header data, representing 36 80-column "card images". There are no line terminator characters, just a 36x80 ASCII array, padded out with blanks if necessary. There may be additional 2880-byte ASCII headers preceding the binary data; you'd have to parse the first set of headers to know how much ASCII to expect.

But I heartily endorse Oscar Reyes' advice to use an existing package to decode FITS files! Two of the packages he mentioned are hosted by NASA's Goddard Space Flight Center, who are also responsible for maintaining the FITS format. That's about as definitive a source as you can get.

Jim Lewis