tags:

views:

84

answers:

5

Currently this would be a sample XML that I am working on:

<smsq>
  <sms>
  <id>96</id>
  <to>03333560511</to>
  <msg>  danial says: hahaha <space> nothing.
  </msg>
  </sms>
</smsq>

Now please notice, that the tag can contain other tags (which should not be parsed) and I had to make a dtd for that. The dtd was something like this:

<!DOCTYPE smsq [
  <!ELEMENT sms (mID,to,msg,type)>
  <!ELEMENT mID (#PCDATA)>
  <!ELEMENT to (#PCDATA)>
  <!ELEMENT msg (CDATA)>
]>

But the problem is that XML parser still goes in the tag and says that the tag should be closed with a tag. I just want to fetch the data as it is from the XML and I do not want to parse msg further.

Please help me resolve the problem and tell me if this can be done with DTDs.

Thanks!

+1  A: 

Firstly the sample xml is not really xml as the "space" tag is not closed.

Secondly, it looks like the reason for not wanting to parse the "space" tag is because it's not really xml - just text that looks like xml. The text should be either escaped/encoded or enclosed in CDATA tags.

Lastly - if what you want to parse really is xml and you only want to parse the first level tags. I wouldn't bother with a real XML parser - i'd create my own ultra-simple parser - all it has to do is parse 1st level nodes - that shouldn't be too hard.

Good luck!

Per Hornshøj-Schierbeck
So it can do done with DTD as long as the data in the nodes are in CDATA sections...
Per Hornshøj-Schierbeck
@Hojou: no, inside CDATA, is not parsed and cannot be defined with a DTD. However, if you want to use a DTD and define non-closed (i.e., open) elements, you *can*, but it is not XML anymore. It is an SGML implementation, which is much harder to work with (like classic HTML)
Abel
@Abel: that is what i ment - if it's in CDATA he won't have to worry about the parser trying to parse it.
Per Hornshøj-Schierbeck
@Hojou: true, but `<msg>bla <![CDATA[your<space>text]]><msg>` is *not equal to* `<msg>bla your<space>text<msg>`. Instead, it is equal to `<msg>bla your<space>text<msg>` which is technically something totally different (i.e., an unclosed element node versus a plain text node). If the asker wants to change the *semantic meaning* of the document, it is fine, but if applications expect a `<space>` node and should react to it, it is not fine. Instead, the source must be corrected.
Abel
@Abel: yep, i was just trying to figure out what he was trying to do - it looked like the <space> "tag" was user-input in which case it should be escaped or enclosed in cdata
Per Hornshøj-Schierbeck
@Hojou Only you solved the real problem. Yes the <space> tag does not have to be parsed or used by the end application. I just needed all the data inside msg tags. But when I do it with CDATA it comes like < and >Can you tell me if this can be replaced with actual < and > when parsing ???
Dee Jay'
@Dee Jay: Check out http://stackoverflow.com/questions/711672/c-htmldecode-without-system-web-possible or http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c - Since you now encode text for xml/html you need to decode it when you get it out :)
Per Hornshøj-Schierbeck
+3  A: 

DTD can't help you with this problem. DTD is by no means required (though it is quite handy to have it).

The document you posted above is not a valid XML document. Period. That's the way it is, and no reasonable XML parser will parse it for you without raising the error.

What you can do though is to substitute < symbol with a &lt; XML entity.

Ihor Kaharlichenko
no offense, but I have written this in simple English, "I do not want to parse some tags in XML" period.
Dee Jay'
+4  A: 

You can't make a DTD that makes buggy XML magically not buggy. The XML is not well-formed, so it can never be valid as well-formedness is a prerequisite of validity (validity isn't even important here AFAICT). It's analogous to how the words in an English sentence have to all be English words before it can be a gramatically-correct English sentence.

<space> is not closed. It should either have a following </space> inside the <msg>, be replaced with <space/> or if by saying you don't want it to be paresed you mean you want the actual text "<space>" in there, then you should encode it as such (i.e. &lt;space&gt;).

Jon Hanna
can you tell me the encoding function in php and decoding function in java please?
Dee Jay'
+1  A: 

All XML tags have to be closed, either like <tag></tag> or <tag />.

If you want the <space> tag to be parsed as the text value of a tag, and not as a child tag, use &lt; and &gt; instead of < and >:

&lt;space&gt;
Frxstrem
Just a note, `>` does not need to be escaped (though it is quite common to do so).
Abel
A: 

I would isolate the solution to your problem into a method and deal with it simply for now. After all, you may not have control over the correctness of the message content.

private static String getMessage(String msg){
    return msg.substring(msg.indexOf("<msg>")+5, msg.lastIndexOf("</msg>"));
}//method

You may enhance it later, as more use cases become available.

Edit: If someone puts an "msg" element in the content, then it still works

Kyle Lahnakoski
And then when someone puts an "msg" element in the content? If they don't have the control necessary to fix the buggy XML, they need to start by defining **precisely** how it is likely to be buggy.
Jon Hanna
It is very unlikely that programmers dealing with XML will ever process it as strings. If they do, it either is not XML, or they made a huge mistake, or both. There are only very few use cases to process XML as strings and this is not one of them (to get your example working, you first have to fix the XML, then parse the XML, then go to the element, then transfer that element into text, then use your function (which needs fixing, as Jon says) then parse it back into XML if needed).
Abel
Jon Hanna: The method looks for the first "<msg>" and last "</msg>". Adding "msg" tags to the message content does not break this code.Abel: The changes you propose "to get [my] example working" does not seem to be any more effective.
Kyle Lahnakoski