views:

761

answers:

5

In my Java application, I am archiving TIBCO RV messages to a file as bytes.

I am writing a small utility app that will play the messages back. This way I can just create a TibrvMsg object from the bytes without having to parse the file and construct the object manually.

The problem I am having is that I am reading a file that was created on a Linux box, and attempting to run my app on a Windows machine. I get an error due to the different charset the file was written in.

So now, what I want to do is log each message in a specific charset (UTF-8), so that I don't care what platform I run my playback app in. The app should just read in the file knowing before-hand the charset the file is written in. I am planning on using java.nio packages for this, to transform the bytes from one charset to another.

Do I need to know what charset the TIBRV message bytes are encoded in to do the transformation? If so, how can I find this out?

+1  A: 

As this (admittedly rather old) mailing list message indicates, little is known about the internal structure of that network protocol. This might make it quite a challenge to do what you're after.

That said, if the messages are just binary blocks of data (as captured from the network), they shouldn't even have a charset. Charsets is for textual data, where it matters since a single character can be encoded in many different ways. Binary data is not composed out of characters, so there cannot be an encoding in that sense.

unwind
What if I take the message bytes and write them to file. Will they be written in the default platform charset?
Jose Chavez
I'd say you have that exactly backwards. Text data does not have a charset. A Charset is what you need to convert text data into a byte stream and vice versa. Since the network data is a byte stream, you need a charset to interpret any text it contains.
Michael Borgwardt
A: 

This is probably related to Java string encoding, not TIBRV. Though there's this in the documentation:

Strings and Character Encodings 

--------------------------------------------------------------------------------

Rendezvous software uses strings in several roles: 

* String data inside message fields
* Field names
* Subject names (and other associated strings that are not
  strictly inside the message)
* Certified delivery correspondent names
* Group names (fault tolerance)

All these strings (both in C and in wire format) use the character
encoding appropriate to the ISO locale of the sender. For example,
the United States is locale en_US, and uses the Latin-1 character
encoding (also called ISO 8859-1); Japan is locale ja_JP, and uses
the Shift-JIS character encoding. 

When two programs exchange messages within the same locale, strings
are always correct. However, when a message sender and receiver use
different character encodings, the receiving program must convert
between encodings as needed. Rendezvous software does not convert
automatically. 

EBCDIC 
For information about string encoding in EBCDIC environments,
see tibrv_SetCodePages() . 

So you might want to look at the locale of the machines.

Nikolai N Fetissov
Ouch. What a horribly misdesigned protocol...
Michael Borgwardt
Yep, the whole thing looks like a joke.
Nikolai N Fetissov
RV is designed for speed at the cost of some flexibility (if you care you implement your own fields) almost all users of it never use different encodings and keep everything at the most basic level everywhere.
ShuggyCoUk
And three extra user/kernel copies per message are helping the speed tremendously :)
Nikolai N Fetissov
why the 3 transitions? all the data is in userspace when you recieve the message callback. If you want to only take from the network buffers what you want then drop tibco and roll your own (well perhaps LBM but I'm finding that a bit of a pain).
ShuggyCoUk
OK, my mistake - four, not three - from kernel to rvd and from rvd to kernel on the sender side, then same on the receiver.
Nikolai N Fetissov
if you're using the cloud based stuff sure, then you're getting multicasting/independent buffers for different listeners, etc etc. if you just talk to one rvd for sender and reciever that drops to 2...
ShuggyCoUk
A: 

Do I need to know what charset the TIBRV message bytes are encoded in to do the transformation?

Yes. A charset is a method of transforming text into a byte stream and vice versa. Your network data is a byte stream, so when you interpret parts of it as text, you ARE (implicitly or explicitly) using a charset - the question is whether it is the correct one.

Transforming bytes from one charset to another basically means convering them to text using one charset and then back to bytes using another. Note that this can result in the length of the data changing, since many charsets use more than 1 byte for some characters. In the context of network messages, this could be problematic when it invalidates length fields or causes text fields to overflow. It's probably better not to do any transformation and instead teach the reading app to learn how to deal with varying charsets.

If so, how can I find this out?

Look at the protocol specification.

Michael Borgwardt
+2  A: 

You are taking opaque data and, it would appear, attempting to write it to a file as textual data without escaping the non textual portions of it (alternatively you are writing it as raw bytes and then trying to read it as if it were character based which is much the same problem). This is flawed from the very start.

Opaque data should be treated as meaningless and simply stored without modification to give back to an API that does know how to deal with it. If the data must be stored in a textual form then you must losslessly convert the bytes into text. Appropriate encodings are things like base64. Encoding in the sense of character set encoding is NOT lossless if you apply it to raw binary data.

Simply storing the bytes in a file as bytes (not characters) along with a fixed length prefix indicating the length of the message and the subject it was sent on is sufficient to replay RV messages through the system.

In relation to any text based fields inside the message if the encoding matters (I strongly suggest avoiding this mattering in general when designing the app) then you have the same problem on replay as you would have had at the original receipt time which is to convert from the source encoding to the desired encoding (hopefully using exactly the same code) so this should be a non issue in relation to the replaying.

ShuggyCoUk
A: 

Read everything inte a byte[] from a inputStream, write the byte[] to a a FileOutputStream.

NO Reader or Writer should be involved, they do character conversion and that is wrong.

Stay away from java.nio until you understand java.io.

KarlP