ansaurus

Question

How to remove these kind of symbols (junk) from string?

Answer 1

A:

The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.

Gishu 2008-09-16 14:11:52

this is easy but not the best solution I think. I need as optimal(fastest) way as it could be. :) but thanks for idea.

Lukas Šalkauskas 2008-09-16 14:16:28

Answer 2

+1 A:

Consider Regex.Replace(your_string, regex, "") - that's what I use.

itsmatt 2008-09-16 14:15:35

nice idea :) I forgot about regex at all :)

Lukas Šalkauskas 2008-09-16 14:17:35

Answer 3

+4 A:

"I DonÃ¢â‚¬â„¢t see ya..".Replace( "Ã¢â‚¬â„¢", string.Empty);

How did that junk get in there the first place? That's the real question.

Will 2008-09-16 14:16:20

"very funny" :)

Lukas Šalkauskas 2008-09-16 14:18:06

@HalFas, it looks like an encoding issue.

broady 2008-09-16 14:35:26

Unfortunately, it can be due to bugs in closed source systems (Eg one (and only one!) of the attributes in Sparxsystems Enterprise Architect's XML export is regularly encoded wrongly at the company's Shanghai branch, preventing their changes to the UML model getting imported in France or England)

Pete Kirkham 2010-01-30 11:38:00

Answer 4

+1 A:

Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...

char.IsLetterOrDigit;

Please there are various others such as...

char.IsSymbol;
char.IsControl;

Phil Wright 2008-09-16 14:16:49

Answer 5

A:

If you would like to remove all non-ASCII characters (this is an assumption), you could try:

StringBuilder builder = new StringBuilder(unicodeString.Length);
for (int ii = 0; ii < unicodeString.Length; ++ii)
{
    if ((int)unicodeString[ii] >= 32 || (int)unicodeString[ii] <= 127)
        builder.Append(unicodeString[ii]);
}

Or if you just want them replaced with something like ?'s, you can use the ASCIIEncoding class (from Encoding.ASCII Property):

byte[] encodedBytes = ASCIIEncoding.GetBytes(unicodeString);
string fixed = ASCIIEncoding.GetString(encodedBytes);

sixlettervariables 2008-09-16 14:16:58

Answer 6

A:

Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):

I DonÃ¢â‚
     W  R

if comma is in your whitelist then you would in this case read the comma and write it where Ã is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.

With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.

Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.

/Allan

Allan Wind 2008-09-16 14:23:05

Answer 7

+2 A:

This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.

Abyss Knight 2008-09-16 14:29:58

Answer 8

+2 A:

By removing any non-latin character you'll be intentionally breaking some internationalization support.

Don't forget the poor guy who's name has a "â" in it.

Marc Hughes 2008-09-16 14:33:30

Answer 9

+2 A:

If you really have to do this, regular expressions are probably the best solution.

I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.

Liedman 2008-09-16 14:34:47

Answer 10

+15 A:

That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.

Ã¢â‚¬â„¢ is the sequence C3 A2, E2 82 AC, E2 84 A2.

UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™

We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)

You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.

Mike Dimmick 2008-09-16 15:01:26

Answer 11

+1 A:

Regex.Replace("The string", "[^a-zA-Z ]","");

That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.

[Edited: forgot the space in the regex]

2008-09-16 15:29:18

ansaurus

tags:

views:

answers:

How to remove these kind of symbols (junk) from string?

related questions