tags:

views:

293

answers:

3

I have this string that has illegal chars that I want to remove but I don't know what kind of chars may be present.

I built a list of chars that I want not to be filtered and I built this script (from another one I found on the web).

on clean_string(TheString)
    --Store the current TIDs. To be polite to other scripts.
    set previousDelimiter to AppleScript's text item delimiters
    set potentialName to TheString
    set legalName to {}
    set legalCharacters to {"a", "b", "c", "d", "e", "f", 
"g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r",
"s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E",
 "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R",
 "S", "T", "U", "V", "W", "X", "Y", "Z", "1", "2", "3", "4", "5",
 "6", "7", "8", "9", "0", "?", "+", "-", "Ç", "ç", "á", "Á", "é",
 "É", "í", "Í", "ó", "Ó", "ú", "Ú", "â", "Â", "ã", "Ã", "ñ", "Ñ",
 "õ", "Õ", "à", "À", "è", "È", "ü", "Ü", "ö", "Ö", "!", "$", "%",
 "/", "(", ")", "&", "€", "#", "@", "=", "*", "+", "-", ",", ".",
 "–", "_", " ", ":", ";", ASCII character 10, ASCII character 13}

    --Whatever you want to eliminate.
    --Now iterate through the characters checking them.
    repeat with thisCharacter in the characters of potentialName
        set thisCharacter to thisCharacter as text
        if thisCharacter is in legalCharacters then
            set the end of legalName to thisCharacter
            log (legalName as string)

        end if
    end repeat
    --Make sure that you set the TIDs before making the
    --list of characters into a string.
    set AppleScript's text item delimiters to ""
    --Check the name's length.
    if length of legalName is greater than 32 then
        set legalName to items 1 thru 32 of legalName as text
    else
        set legalName to legalName as text
    end if
    --Restore the current TIDs. To be polite to other scripts.
    set AppleScript's text item delimiters to previousDelimiter
    return legalName
end clean_string

The problem is that this script is slow as hell and gives me timeout.

What I am doing is checking character by character and comparing against the legalCharacters list. If the character is there, it is fine. If not, ignore.

Is there a fast way to do that?

something like

"look at every char of TheString and remove those that are not on legalCharacters"

?

thanks for any help.

+2  A: 

Iterating in Applescript is always slow, and there really isn't a faster way around these problems. Logging in loops is an absolutely guaranteed way to slow things down. Use the log command judiciously.

In your specific case, however, you have a length limit, and moving the length check into into the repeat loop will potentially cut the processing time down considerably (just under a second to run in Script Debugger regardless of length of text):

    on clean_string(TheString)
     set potentialName to TheString
     set legalName to {}
     set legalCharacters to {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", "?", "+", "-", "Ç", "ç", "á", "Á", "é", "É", "í", "Í", "ó", "Ó", "ú", "Ú", "â", "Â", "ã", "Ã", "ñ", "Ñ", "õ", "Õ", "à", "À", "è", "È", "ü", "Ü", "ö", "Ö", "!", "$", "%", "/", "(", ")", "&", "€", "#", "@", "=", "*", "+", "-", ",", ".", "–", "_", " ", ":", ";", ASCII character 10, ASCII character 13}
 with timeout of 86400 seconds --86400 seconds = 24 hours

     repeat with thisCharacter in the characters of potentialName
      set thisCharacter to thisCharacter as text
      if thisCharacter is in legalCharacters then
       set the end of legalName to thisCharacter
       if length of legalName is greater than 32 then
        return legalName as text
       end if
      end if
     end repeat
 end timeout
     return legalName as text
    end clean_string
Philip Regan
thanks but this loop is giving me this error Result:error "AppleEvent timed out." number -1712... I suppose the text is too long and applescript is not willing to wait for it to finish.
Digital Robot
I've added a timeout block to the code, but you shouldn't be getting that in here (I believe the default timeout is 60 seconds). I ran the code on the complete text of this page without any problems. I'm thinking you may have to wrap the timeout block around the call to the subroutine or somewhere else higher up in the stack.
Philip Regan
+2  A: 

What non-ascii characters are you running into? What is your file encoding?

It's much, much more efficient to use a shell script and tr, sed or perl to process text. All languages are installed by default in OS X.

You can use a shell script with tr (as the example below) to strip returns, and you can also use sed to strip spaces (not in the example below):

set clean_text to do shell script "echo " & quoted form of the_string & "| tr -d '\\r\\n' "

Technical Note TN2065: do shell script in AppleScript

Or, with perl, this will strip non-printing characters:

set x to quoted form of "Sample text. smdm#$%%&"
set y to do shell script "echo " & x & " | perl -pe 's/[^[:alnum:]|[:space:]]//g'"

Search around SO for other examples of using tr, sed and perl to process text with Applescript. Or search MacScripter / AppleScript | Forums

songdogtech
+1  A: 

Another Shell script method might be:

set clean_text to do shell script "echo " & quoted form of the_string & "|sed \"s/[^[:alnum][:space:]]//g\""

that uses sed to delete everything that isn't an alphanumeric character, or space. More regex reference here

stib
This is a good string, too, for processing text.
songdogtech