tags:

views:

472

answers:

7

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.

We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.

Currently I'm using this approach:

string redact(string eventDetail)
{
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    ...

Is there a better way?

Are there any edge cases this approach could miss?

I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.

Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:

<TableName FirstField="Foo" SecondField="Bar" /> 
<TableName FirstField="Foo" SecondField="Bar" /> 
...

Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.

SOLUTION

Based on multiple comments (thanks guys!)

string redact(string eventDetail)
{
    if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    XmlDocument xml = new XmlDocument();
    try
    {
        xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
    }
    catch (XmlException e)
    {
        log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
        return eventDetail;
    }
    ... // redact
+5  A: 

If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.

lod3n
I beat you by 2 seconds, Ha!
Martin
Ha, ha! You win!
lod3n
You might as well edit the post and put "First!"
Spence
I've updated the question with example data. We can assume it is well formed except for missing root element.
Si
+2  A: 

If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.

JaredPar
I don't think `XmlDocument` is a good choice here - he doesn't need DOM, merely to validate. Looks like `XmlReader` and `try { while (reader.Read(); } catch(XmlException ex) { ... }` would be a more lightweight approach.
Pavel Minaev
@Pavel, but I also have to modify the Xml to redact the data, hence the need for XmlDocument.
Si
Agreed, but if I combine approaches (as per Samuel's idea), then I should catch 99% of the plain text with the StartsWith and EndsWith code, and leave the other 1% to be caught if LoadXml throws XmlException.
Si
A: 

If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:

foo<bar/>baz

In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!

Pavel Minaev
+1  A: 

Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance, are you OK with accepting

  <the quick brown fox jumped over the lazy dog's back>

as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.

Ira Baxter
Yes, that's the sort of thing I'm afraid of hitting.
Si
+1  A: 

How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.

Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).

We need to know more about the data format really.

Noon Silk
+2  A: 

One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.

Samuel Carrijo
Good idea, thanks.
Si
I've marked this as the most appropriate answer, because I think it solves my problem in the most efficient way. For most cases, StartsWith < and EndsWith > will filter out non-xml data, and for rare situations like Ira Baxter describes, catching the XmlException will solve those.
Si
A: 
try
{
    XmlDocument myDoc = new XmlDocument();
    myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
    //take care of the exception
}
Evgeny
Of course, and this is stated in the question. But catching exceptions is expensive when I know a lot of the data is not xml.
Si