views:

364

answers:

3

I want to replace certain characters in an input string with other characters.

The input text has Microsoft left and right smart quotes which I would like to convert to just a single ".

I was planning on using the Replace operation, but am having trouble forming the text string to be searched for.

I would like to replace the input sequence (in hex) \xE2809C, and change that sequence to just a single ". Ditto with \xE2809D.

How do I form the string to use in the Replace operation?

I'm thinking of something like (in a loop):

tempTxt = tempTxt.Replace(charsToRemove[i], charsToSubstitute[i]);

but I'm having trouble creating the charsToRemove array.

Maybe a bigger question is whether the whole input file can be read and converted to plain ASCII using some read/write and string conversions in C#.

Thanks, Mike

+1  A: 

Something like this?

char [] charsToRemove = {
    '\u201C', // These are the Unicode code points (not the UTF representation)
    '\u201D'
};

char [] charsToSubstitute = {
    '"',
    '"'
};
Sean Bright
The actual hex values in my input are E2809C. These may not correspond to smart quotes, but they correspond to some kind of double quote that is displayed by Word.I have been able to remove the 201C and 201D quotes, but the E2809C and E2809D.Suggestions?Thanks again, Mike
Mike
Could you show some of the code where you are reading the file?
Sean Bright
Is it possible to email you with some code?
Mike
I'd rather you not. How are you reading the file? `StreamReader`? `BinaryReader`?
Sean Bright
See a longer entry below. -- Mike
Mike
Nevermind, if I could tell the difference between hex and decimal, this would have worked!! Sorry for the trouble.
Mike
A: 

You may want to give Regex a shot. Here's an example that will replace smart-quoted text with the single ".

string tempTxt = "I am going to “test” this.  “Hope” it works";
string formattedText = Regex.Replace(tempTxt, "s/“|”|“|”/", @"""");
JHBlues76
A: 

I'm using a ReqPro40.dll to read data. The data is stored as text. Hope I didn't lose too much on copy/paste below. The stuff below works to the best of my knowledge. But I want to get rid of longer sequences of bad characters. E2809C should become a quote, but I'm having trouble matching it.

string tempTxt = Req.get_Tag(ReqPro40.enumTagFormat.eTagFormat_ReqNameOrReqText);
tempTxt=tempTxt.Substring(1, tempTxt.Length-1);

char[] charsToRemoveForXMLLegality = new char[]
{ '\x000a', '\x000b', '\x0002', '\x001e', // NL, VT, STX, RS
  '\x0034', '\x8220', '\x8221',           // ", left double, right double quote
  '\x8216', '\x8217',                     // left single quote, right single quote
  'x8211', '\x8212',                     // en-dash, em-dash
  '\x0188', '\x0177',                     // 1/4 fraction, plus/minus
  '\x8230', '\x0160'                      // ellipsis, non-breaking space
};   
string[] charsToSubstituteForXMLLegality = new string[]
        { " ", " ", "", "-",
          "\"", "\"", "\"",
          "\'", "\'",
          "-", "-",
          "1/4", "+/-",
          "...", " "
       };

for (int i = 0; i < charsToRemoveForXMLLegality.Length; i++)
{
    tempTxt = tempTxt.Replace(charsToRemoveForXMLLegality[i].ToString(), charsToSubstituteForXMLLegality[i]);
}
Mike
Some text: Progress…”.</requirement-text>should be Progress...".</requirement-text>Hex values are:50 72 6F 67 72 65 73 73 E2 80 A6 E2 80 9D 2E 3C 2F 72 65 71 75 69 72 65
Mike
Nevermind, if I could tell the difference between hex and decimal, this would have worked!! Sorry for the trouble.
Mike