tags:

views:

213

answers:

4

I get the XML from a web service in the format below and I want to clean it up (remove the extra "\" and "\n" characters) before working with it. I am currently using the regular expression below to match. However only the "\n" characters are cleaned up, while the "\" characters which are in between equal and double quotation marks persist.

What do you advise me to do?

private string ValidateXml(string dirtyXml) {
    Regex regex = new Regex(@"[\\\][\n]");
    var cleanXml = regex.Replace(dirtyXml, "");
    return cleanXml;
}

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n\n<ISBNdb server_time=\"2010-01-28T11:31:08Z\">\n<BookList total_results=\"1\" page_size=\"10\" page_number=\"1\" shown_results=\"1\">\n<BookData book_id=\"quantitative_techniques\" isbn=\"0826458548\" isbn13=\"9780826458544\">\n<Title>Quantitative techniques</Title>\n<TitleLong></TitleLong>\n<AuthorsText>Terry Lucey</AuthorsText>\n<PublisherText publisher_id=\"continuum\">London : Continuum, 2002.</PublisherText>\n</BookData>\n</BookList>\n</ISBNdb>\n"
A: 

You don't really need a regex for this, you can just use a couple of calls to String.Replace.

This should do the trick:

var cleanXml = dirtyXml.Replace("\\n", "").Replace("\\\"", "\"");
Nick Higgs
Hey Nick, it doesn't quite fix it. The characters are still in there
simplyme
+1  A: 

Your regex is a bit odd, it will match the following:

  • \\ single backslash character
  • \[ single [ character
  • ] single ] character
  • \n newline character

The following regex will match what you described:

@"\\n?"

It matches either literal \n or \. Note that the backslash will match even when it is not followed by quote. To match only the backslashes followed by a quote, you can use this pattern:

@"(\\n)|(\\(?=""))"
Bojan Resnik
Good catch! I originally read `[\\\][\n]` as two character classes, but you're right, it's only one.
Alan Moore
I recommend writing your regexes as string literals to avoid confusion. I don't think it's safe to assume all readers are conversant with C#'s verbatim strings, and anyone who assumes you meant `new Regex("\\n?")` will be very confused indeed. :-)
Alan Moore
Thanks for the suggestion, corrected.
Bojan Resnik
A: 

It looks like you want an | in that code to say match either \n or \

Try this

[\\][n]|[\\]
Robb
+1  A: 

The question still isn't clear: if you write the XML string (before you try to clean it) to the console, do you see exactly what you posted above, with all those \" and \n sequences? Does the displayed string start and end with a quotation mark? If so, you probably want to remove the opening and closing quotation marks and all the backslashes, and if any backslash is followed by an 'n', you want to remove that as well. Here's some code to demonstrate:

static void Main(string[] args)
{
  string dirtyXml = @"""<?xml version=\""1.0\"" encoding=\""UTF-8\""?>\n\n<ISBNdb server_time=\""2010-01-28T11:31:08Z\"">\n<BookList total_results=\""1\"" page_size=\""10\"" page_number=\""1\"" shown_results=\""1\"">\n<BookData book_id=\""quantitative_techniques\"" isbn=\""0826458548\"" isbn13=\""9780826458544\"">\n<Title>Quantitative techniques</Title>\n<TitleLong></TitleLong>\n<AuthorsText>Terry Lucey</AuthorsText>\n<PublisherText publisher_id=\""continuum\"">London : Continuum, 2002.</PublisherText>\n</BookData>\n</BookList>\n</ISBNdb>\n""";
  Console.WriteLine(dirtyXml);
  Console.WriteLine();
  Console.WriteLine(Regex.Replace(dirtyXml, @"^""|""$|\\n?", ""));
}

output:

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n\n<ISBNdb server_time=\"2010-01-28T11:31:08Z\">\n<BookList total_results=\"1\" page_size=\"10\" page_number=\"1\" shown_results=\"1\">\n<BookData book_id=\"quantitative_techniques\" isbn=\"0826458548\" isbn13=\"9780826458544\">\n<Title>Quantitative techniques</Title>\n<TitleLong></TitleLong>\n<AuthorsText>Terry Lucey</AuthorsText>\n<PublisherText publisher_id=\"continuum\">London : Continuum, 2002.</PublisherText>\n</BookData>\n</BookList>\n</ISBNdb>\n"

<?xml version="1.0" encoding="UTF-8"?><ISBNdb server_time="2010-01-28T11:31:08Z"><BookList total_results="1" page_size="10" page_number="1" shown_results="1"><BookData book_id="quantitative_techniques" isbn="0826458548" isbn13="9780826458544"><Title>Quantitative techniques</Title><TitleLong></TitleLong><AuthorsText>Terry Lucey</AuthorsText><PublisherText publisher_id="continuum">London : Continuum, 2002.</PublisherText></BookData></BookList></ISBNdb>

Does this accurately reflect what you're starting with and what you want to end up with?

Alan Moore
Spot on Alan. Thanks a lot.
simplyme