tags:

views:

176

answers:

6

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.

A: 

What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.

Foredecker
A: 

My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):

#!/usr/bin/perl

use strict;
use warnings;

my %seen;
while (<>) {
    for my $character (grep { ord($_) > 127 } split //) {
        $seen{$character}++;
    }
}

print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;

I then create a mapping of those characters to what I want them to be and replace them in the file:

#!/usr/bin/perl

use strict;
use warnings;

my %map = (
    chr(128) => "foo",
    #etc.
);

while (<>) {
    s/([\x{80}-\x{FF}])/$map{$1}/;
    print;
}
Chas. Owens
I'm not even looking to replace them with anything... just to remove them. Given the criteria, I can just do a regEx on the 127... but I don't *want to do that.... I want to know what those chars are and target them specifically.
Dr.Dredel
So run the first program, find the ones you want deleted and set their %map values to an empty string, set the %map values of the characters you want to save to themselves. Or just create a character class that has the values you want deleted: s/[\x{86}\x{89}]//; That will remove all characters that have ordinal values of 134 or 137 (the first script will tell you the ordinal values).
Chas. Owens
A: 

Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".

Alan Moore
+1  A: 

Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class

[\x80-\x9F]

this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.

AmbroseChapel
those values don't see to match the word specific ’, “ and ”. But I like this approach the most because offhand I don't think there are any other chars that word has that are outside of the norm... or am I missing something?
Dr.Dredel
A: 

What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.

In SendKeys it would be a script of the form

chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
    {LWIN}
    {PAUSE .25}
    r
    winword.exe{ENTER}
    {PAUSE 1}
    %(all)s
    +(%(all)s)
    "testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
    {Alt}{PAUSE .25}{SHIFT}
    changeLanguage
    %(all)s
    +%(all)s
"""%{'all':all})

Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).

If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.

Elazar Leibovich
A: 

I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:

  • hyphens and dashes to minus sign;
  • suspsension dots (single char) to multiple dots;
  • list item dot to asterisk;
  • etc.

It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic: http://www.megadix.it/node/138

Cheers

Megadix