[^\x20-\x7E]
I saw this pattern used for a regular expression in which the goal was to remove non-ascii characters from a string. What does it mean?
[^\x20-\x7E]
I saw this pattern used for a regular expression in which the goal was to remove non-ascii characters from a string. What does it mean?
It means "anything that isn't a character code in the hexadecimal range 0x20 to 0x7E, i.e. 32 to 126".
it says something like: all characters that are not (^) in the range \x20-\x7E (hex 0x20-0x7E). According to http://www.asciitable.com/, those are characters from space to ~.
The caret (^) inside the brackets [] means "not", and the \x20-\x7E denotes the byte range of valid ascii characters, where \x20 is the beginning of the range, and \x7E is the end.
I can't say for certain that this is an accurate method for the stated purpose, but that is how the expression is read.
It means match any characters that are not printing characters.
Printing characters include a to z, A to Z, 0 to 9 and symbols such as ",;$#% etc.
^ not
\x20 hex code for space character
- to
\x7e hex code for ~ (tilde) character
All the ascii printing characters fall between these two.
This statement matches non ascii characters as well as ascii control (non printing) characters such as bell, tab, null and others.
Look at
man ascii
on a unix system to see which characters it matches.
In perl, you could also write this as
[^ -~]
or
[[:^cntrl:]]
This last one is slightly different, in that it matches any non control character, including extended ascii (e.g. accented characters) and unicode.
You may not want to restrict yourself to just ascii, since non US locations often use valid printing characters outside this small range, e.g. øüéåç...