views:

410

answers:

8

I have a Java socket connection that is receiving data intermittently. The number of bytes of data received with each burst varies. The data may or may not be terminated by a well-known character (such as CR or LF). The length of each burst of data is variable.

I'm attempting to build a string out of each burst of data. What is the fastest way (speed, not memory), to build a string that would later need to be parsed?

I began by using a byte array to store the incoming bytes, then converting them to a String with each burst, like so:

byte[] message = new byte[1024];
...
message[i] = //byte from socket
i++;
...
String messageStr = new String(message);
...
//parse the string here

The obvious disadvantage of this is that some bursts may be longer than 1024. I don't want to arbitrarily create a larger byte array (what if my burst is larger?).

What is the best way of doing this? Should I create a StringBuilder object and append() to it? That way I don't have to convert from StringBuilder to String (since the former has most of the methods I need).

Again, speed of execution is my biggest concern.

TIA.

+2  A: 

StringBuilder is your friend. Add as many characters as needed, then call toString() to obtain the String.

KLE
+8  A: 

Note that as you're transmitting across network layers, your speed of conversion may not be the bottleneck. It would be worth measuring, if you believe this to be important.

Note (also) that you're not specifying a character encoding in your conversion from bytes to String (via characters). I would enforce that somehow, otherwise your client/server communication can become corrupted if/when your client/server run in different environments. You can enforce that via JVM runtime args, but it's not a particularly safe option.

Given the above, you may want to consider StringBuilder(int capacity) to configure it in advance with an appropriate size, such that it doesn't have to reallocate on the fly.

Brian Agnew
+1 for mentioning character encoding
kdgregory
Read the answer regarding the InputStreamReaderReader wrapped around a BufferedInputStream. Trim the buffer size if you need.
KarlP
I believe you mean the InputStreamReader (to be clear)
Brian Agnew
+2  A: 

I would create a "small" array of characters and append characters to it. When the array is full (or transmission ends), use the StringBuilder.append(char[] str) method to append the content of the array to your string.

Now for the "small" size of the array - you will need to try various sizes and see which one is fastest for your production environment (performance "may" depend on the JVM, OS, processor type and speed and so on)

EDIT: Other people mentioned ByteArrayOutputStream, I agree it is another option as well.

Adrian
+2  A: 

You may wish to look at ByteArrayOutputStream depending if you are dealing with Bytes instead of Characters.

I generally will use a ByteArrayOutputStream to assemble a message then use toString/toByteArray to retrive it when the message is finished.

Edit: ByteArrayOutputStream can handle various Character set encoding through the toString call.

Scott Markwell
A: 

Personally, independent of language, I would send all characters to an in-memory data stream and once I need the string, I would read all characters from this stream into a string. As an alternative, you could use a dynamic array, making it bigger whenever you need to add more characters. Even better, keep track of the actual length and increase the array with additional blocks instead of single characters. Thus, you would start with 1 character in an array of 1000 chars. Once you get at 1001, the array needs to be resized to 2000, then 3000, 4000, etc...

Fortunately, several languages including Java have a special build-in class that specializes in this. These are the stringbuilder classes. Whatever technique they use isn't that important but they have been created to boost performance so they should be your fastest option.

Workshop Alex
A: 

Have a look at the Text class. It's faster (for the operations you perform) and more deterministic than StringBuilder.

Note: the project containing the class is aimed at RTSJ VMs. It is perfectly usable in standard SE/EE environments though.

yawn
+4  A: 

First of all, you are making a lot of assumptions about charachter encoding that you receive from your client. Is it US-ASCII, ISO-8859-1, UTF-8?

Because in Java string is not a sequence of bytes, when it comes to building portable String serialization code you should make explicit decisions about character encoding. For this reason you should NEVER use StringBuilder to convert bytes to String. If you look at StringBuilder interface you will notice that it does not even have an append( byte ) method, and that's not because designers just overlooked it.

In your case you should definetly use a ByteArrayOutputStream. The only drawback of using straight implementation of ByteArrayOutputStream is that its toByteArray() method returns a copy of the array held by the object internaly. For this reason you may create your own subclass of ByteArrayOutputStream and provide direct access to the protected buf member.

Note that if you don't use default implementation, remember to specify byte array bounds in your String constructor. Your code should look something like this:

MyByteArrayOutputStream message = new MyByteArrayOutputStream( 1024 );
...
message.write( //byte from socket );
...
String messageStr = new String(message.buf, 0, message.size(), "ISO-8859-1");

Substitute ISO-8859-1 for the character set that's suitable for your needs.

Alexander Pogrebnyak
What if the bytes encoding a UTF-8 character are split across packets?
starblue
+12  A: 

I would probably use an InputStreamReader wrapped around a BufferedInputStream, which in turn wraps the socket. And write code that processes a message at a time, potentially blocking for input. If the input is bursty, I might run on a background thread and use a concurrent queue to hold the messages.

Reading a buffer at a time and trying to convert it to characters is exactly what BufferedInputStream/InputStreamReader does. And it does so while paying attention to encoding, something that (as other people have noted) your solution does not.

I don't know why you're focused on speed, but you'll find that the time to process data coming off a socket is far less than the time it takes to transmit over that socket.

kdgregory
+1 Finally a sane answer.
starblue
Also pay attention to the character encoding... If its the "default" (is it java or platform?) you must specify it in the constructor of the reader. Unless you have zillions of connections, go for a background thread per socket, that injects complete messages on a queue.
KarlP