views:

615

answers:

6

Hi,

I have a REST webservice that listens to POST requests and grabs hold of an XML payload from the client and stores it initially as an InputStream i.e. on the Representation object you can call getStream().

I want to utilise the XML held in the InputStream and I am begining to think it would be wise to persist it, so I can interrogate the data multiple times - as once you read through it, the object becomes null. So I thought about converting the InputStream to a string. This is not a good idea as DocumentBuilder.parse() from javax.xml.parsers library will only allow you to pass:

  • InputStreams
  • Files
  • URLs
  • SAX InputSources

not strings.

What should I really be doing here with InputStreams in relation to parsing XML out of it? Bearing in mind I will want to re-interrogate that XML in future processes by the code.

A: 

java.io.StringReader will allow you to use InputSource.

You might want to store the data in a byte[] and then read with ByteArrayInputStream. If it's particular large, you might might want to consider compression. This can be read out iwth GzipInputStream, which should often be wrapped in a BufferedInputStream.

Tom Hawtin - tackline
-1 because you NEVER want to read wild XML with a StringReader, unless you're getting the encoding externally (which may be the case in web services).
kdgregory
sorry, make that any Reader
kdgregory
@kdgregory - Is that because the end-of-line might vary between encodings?
Don Branson
I believe it's because (stupidly) the XML header can specify character set. But if you've got a String, you already have that problem and there is no additional damage caused by Reader.
Tom Hawtin - tackline
(RTF gets it even worserer.)
Tom Hawtin - tackline
@Tom lol. so, i think my suggestion of reading into a byte array and handing a BAIS to the parser should avoid this issue.
Don Branson
No, that's actually the least of your worries. Worse is when your XML is encoded as UTF-8 (the standard default), and you wrap it with an InputStreamReader that assumes it's ISO-8859-1 (the platform default for US Linux installations) or Windows-1252 (the platform default for US Windows installs)
kdgregory
@Tom - the problem is that the OP doesn't have a string, he/she has an InputStream (or "getStream()" has a very bad name)
kdgregory
@Tom #2: it's really not stupid for the XML document to specify its own encoding: the world isn't Java, or Unicode, and a file is ultimately just a bunch of bytes. Either you need to externalize knowledge about those bytes (bad for interop), or you specify internally, or you make it part of the spec
kdgregory
(cont) XML has the two latter options: you can specify encoding, or you are required to use UTF-8 (or UTF-16, but that's a whole 'nother mess)
kdgregory
Why would you deliberately use an `InputStreamReader` with the wrong encoding (unless you've been deliberately given the wrong encoding)?
Tom Hawtin - tackline
Xml is a text format that uses encoded text to specify what that encoding is. That is dumb. Either use a binary format or specify encoding out of stream.
Tom Hawtin - tackline
Re using an InputStreamReader with the wrong encoding: you have to specify the encoding. Which means that you either get it from the Content-Type header, or you guess. Better is to use the raw InputStream and let the parser decide.
kdgregory
Re XML as text format: actually, it isn't. The spec says that it can deal with UTF-8 (a binary format) or UTF-16 (another binary format), or whatever encoding you specify in the prologue. "Text format" has no meaning in an I18N world (is it ASCII? EBCDIC? ISO-8859-1?)
kdgregory
A: 

I think you should look into some structures better suited for preserving encodings (ie. more encoding agnostic). For low-level structures, consider byte[] (but be careful with memory deallocation!) or you could try to design a data type that fits your needs.

You could read the InputStream into a ByteArrayOutputStream (using one of the read() methods) and extract the byte[] from there.

Steen
+1  A: 

Generally, when we're talking persistence, we're talking about writing it to disk or other media. There's a performance hit there, and you have to think about disk space concerns. You'll want to weigh that against the value of having that XML around for the long term.

If you're just talking about holding it in memory (which sounds like what you're asking), then you could allocate a byte array, and read the whole thing into the byte array. The you can use ByteArrayInputStream to read and re-read that stream.

The cost with that is two-fold. First, you're holding a copy in memory, and you need to weigh that against your scalability requirements. Second, parsing XML is somewhat expensive, so it's best to parse it once only, if possible, and save the result in an object.

Edit:

To allocate and read the byte array, you can often (but not always) rely on InputStream's available() method to tell you how much to allocate. and wrap the InputStream with a DataInputStream so that you can call readFully() to suck the whole thing into the byte array with one call.

Edit again:

Read Steen's comment below. He's right that it's a bad idea to use available() in this case.

Don Branson
I a live environment, _never_ use available() as a means of getting the 'size' of the Stream. Heck, you shouldn't even use it in your backyard ;)
Steen
Instead use the read() as described in my my elaborated post somewhere on this page (I never can get used to Stackoverflows floating answers)
Steen
It's fine to use against FileInputStream, but it's problematic when used against network-backed streams. I didn't state strongly enough my reluctance to use it in this case.
Don Branson
+1  A: 

I would advise to use the Apache Commons IO library. The IOUtils class contains many convenience methods to convert InputStreams to String and vice versa.

Johan Pelgrim
Good advice. Saves a few lines over reading it yourself, as I described in my answer.
Don Branson
But I don't think I should convert it to a string - so would IOUtils be of any use in this case?
Vidar
You shouldn't convert to a string unless you know the encoding. However, IOUtils also gives you a TeeInputStream, so you could save a copy as bytes.
kdgregory
A: 

If you want to use the XML multiple times, why not parse it once from the InputStream (which is the heavy work), and then hold on to the Document returned?

Avi
+2  A: 

If you have an InputStream, and want to use it as an XML document, then why aren't you simply parsing it and passing around the Document object? If you want to persist this object, then use a serializers to write it back out as text.

As I noted in my comment to Tom Hawtin, encoding is very important when dealing with XML. Rather than write a long posting here that may miss your specific situation, here's an article that I wrote.

Edit: actually, since my article doesn't specifically talk about web services, I should dive into it a little here. There are two places where the content encoding can be specified: in the XML prologue, or in the Content-Type response header. According to the XML spec, the former is the one that you want to use, and it's what the parser will use. In most cases, that doesn't matter: a webservice set up by a person who doesn't know the spec will typically use a text/xml without a character set specification (which is incorrect but probably not going to cause harm). If they do things correctly, they'll specify application/xml, with utf-8 encoding. However, you should verify what you're getting, so that you don't end up with some strange encoding that the parser can't handle.

kdgregory
I think passing the Document object seems the most painless thing to do here - and slightly obvious - sorry I don't think my brain was working at all this morning!
Vidar