ansaurus

Question

Answer 1

+5 A:

If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.

lod3n 2009-09-29 00:54:50

I beat you by 2 seconds, Ha!

Martin 2009-09-29 00:55:49

Ha, ha! You win!

lod3n 2009-09-29 00:58:10

You might as well edit the post and put "First!"

Spence 2009-09-29 00:59:03

I've updated the question with example data. We can assume it is well formed except for missing root element.

Si 2009-09-29 01:03:46

Answer 2

+2 A:

If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.

JaredPar 2009-09-29 00:55:26

I don't think `XmlDocument` is a good choice here - he doesn't need DOM, merely to validate. Looks like `XmlReader` and `try { while (reader.Read(); } catch(XmlException ex) { ... }` would be a more lightweight approach.

Pavel Minaev 2009-09-29 00:57:54

@Pavel, but I also have to modify the Xml to redact the data, hence the need for XmlDocument.

Si 2009-09-29 01:08:34

Agreed, but if I combine approaches (as per Samuel's idea), then I should catch 99% of the plain text with the StartsWith and EndsWith code, and leave the other 1% to be caught if LoadXml throws XmlException.

Si 2009-09-29 01:34:32

Answer 3

A:

If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:

foo<bar/>baz

In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!

Pavel Minaev 2009-09-29 00:55:55

Answer 4

+1 A:

Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance, are you OK with accepting

  <the quick brown fox jumped over the lazy dog's back>

as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.

Ira Baxter 2009-09-29 00:57:59

Yes, that's the sort of thing I'm afraid of hitting.

Si 2009-09-29 01:06:51

Answer 5

+1 A:

How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.

Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).

We need to know more about the data format really.

Noon Silk 2009-09-29 00:59:53

Answer 6

+2 A:

One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.

Samuel Carrijo 2009-09-29 01:14:41

Good idea, thanks.

Si 2009-09-29 01:17:05

I've marked this as the most appropriate answer, because I think it solves my problem in the most efficient way. For most cases, StartsWith < and EndsWith > will filter out non-xml data, and for rare situations like Ira Baxter describes, catching the XmlException will solve those.

Si 2009-09-29 04:49:19

Answer 7

A:

try
{
    XmlDocument myDoc = new XmlDocument();
    myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
    //take care of the exception
}

Evgeny 2009-09-29 02:09:29

Of course, and this is stated in the question. But catching exceptions is expensive when I know a lot of the data is not xml.

Si 2009-09-29 04:50:35

ansaurus

tags:

views:

answers:

How to tell if a string is xml?

related questions