tags:

views:

483

answers:

6

Hi all,

I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.

Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).

Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.

1). I look from the start of the byte array until I find '\n' or '\r' without '\n'. 2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'. 3). If that success without exception, we found the first valid line and return it. 4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat. 5. If the search ends at the end of available bytes I stop and return null (no valid line found).

My question is (Q2)Is the following algorithm adequate?

Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.

So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?

I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.

To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.

Thanks you all in advance.

+1  A: 

The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.

Steve Gilham
Thanks for answering. I takes a look at it and it seems that I need to have my data in either `String` or `CharacterIterator` which is exactly the problem here (I can't get that).
NawaMan
+1  A: 

Q2: The method you use seems reasonable enough to work.

Q1: Can't think of something better than the algorithm that you are using

Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.

rslite
+4  A: 

I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.

Sensible choices would be:

  • the platform default encoding (Java property "file.encoding")
  • UTF-8 (as it is most common)
  • a list of encodings you know your clients will use (such as several Russian or Chinese encodings)

What makes sense will depend on the data, there's no general answer.

Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.

But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.

sleske
Thanks for your answer. I mentioned in my question that Charset (which contain encoding information) might be used. If it is not given (this I didn't put in my question), I will use the default one. About the unicode, you think I should try to detect all of the above? Thanks again for the answer.
NawaMan
There is no such thing as a "default" charset.
Jonathan Feinberg
@Jonathan: Well, *static Charset Charset.defaultCharset()* comes pretty close, doesn't it? See http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html.
sleske
Yes, that method exists, but is not useful (given that the data could have come from some other machine, where the result of that method is different from the result on your machine).
Jonathan Feinberg
@Jonathan: Yes, of course. That's why I wrote "What makes sense will depend on the data". Maybe he knows the data was locally generated... . But normally I'd stay away from defaultCharset as well.
sleske
+2  A: 

First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?

If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.

    `4). Otherwise, these bytes may not be a string, so I look further to 
another '\n' or '\r' w/o '\n'. and this process repeat.`

Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.

I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.

Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.

Also note that in Java 6 there are 2 constructors that take charset String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).

Alexander Pogrebnyak
Thanks for answering :D. (1) I am creating a non-lock input stream and I plan to use it for shell access and network connection which I have no control over. (2) The reason why I look to another '\n' because I think that IT IS MIGHT BE POSSIBLE THAT 0x0D and 0x0A MIGHT BE PART OF A VALID UNICODE which I am not sure. But just in case it is a binary that contain it may be a valid one line of string. That is why I did this. (3) After I posted I actually find that out and have changed to CharsetDecoder :D. Thanks anyway.
NawaMan
+1  A: 

I would try this:

  • make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
  • the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.

Some pseudo java code (missing exception & io handling, generics, imports++) :

class IORunner extends Thread {
  IORunner(InputStream in, BlockingQueue outputQueue) {
    this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
    this.outputQueue = outputQueue;
  }

  public void run() {
    String line;
    while((line=reader.readLine())!=null)
      this.outputQueue.put(line);
  }
}

class Main {
  public static void main(String args[]) {
    ...
    BlockingQueue dataQueue = new LinkedBlockingQueue();
    new IORunner(myStreamFromSomewhere, dataQueue).start();

    while(true) {
      if(!dataQueue.isEmpty()) { // can also use .peek() != null
        System.out.println(dataQueue.take());
      }
      Thread.sleep(1000);
    }
  }
}
  • The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
  • The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).
Melv
+1  A: 

I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.

Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.

The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?

May be able to just use read / write UTF() in DataStream - why not?

{ edit: per OP's request }

//Sending end 

String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//

// rcv end

DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
   // constructor
   // String(byte[] bytes, int offset, int length)

   final String result = new String(datasink,0x00000000,sizeToRead);//          

   // continue coding here

Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )

Nicholas Jordan
Thanks for answering but would you mind showing me some code. Just can't picture it. :-)
NawaMan
Thanks Nicholas. I appropriate your time and energy. But in this case, I do not have control over the construction of the bytes so I can't send the length of the string. In fact, I don't even know if it is a string (I didn't spot the length thing at my first read). Using readUTF seems interesting as Java will take care if the bytes are string for me. I will try that. Thanks again for your help.
NawaMan
Sounds like a regex ( problem, approach, whatever ( Checkout www.regexbuddy.com and any other resources you can find, be prepared for a swoop in the development efforts - I have to go to in a matte of minutes but a brief re-read of your op with this suggest there is a String.split() that will do what you want. Manually writing your own can be done but using established tools will likely get more done. There should be a regex to split on line.separator - you may also be able to use a reader as well as java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
Nicholas Jordan