views:

149

answers:

5

What does this regular express mean. It is in an XML schema that I am using:

([!-~]|[ ])*[!-~]([!-~]|[ ])*

-Dave

+1  A: 

[!-~] Matches any of the characters between "!" and "~" (the represented characters theoretically depend on the encoding in use)

[ ] Matches a space character

(x|y) Matches one of x or y

(x)* Matches any number of subsequent occurrences of x, (including none).

Romain
It seems like it would be better written as `([!-~ ])*[!-~]([!-~ ])*`
Anon.
There are always other ways to write regular expressions, unless they are trivial :)
Romain
+1  A: 

Any characters in the range of ! to ~ or spaces, followed by one character of the range ! to ~, followed by any number of that same range or spaces again. So it would appear to be the same as:

([!-~ ])*[!-~]([!-~ ])*
Stephen Cross
Or also equivalent to `([!-~]|[ ]?)+`. Note the fact [!-~] is actually a character class, and not a character set (it's all between ! and ~, and not !, ~ and -).
Romain
@Romain: No, your example matches (among other incorrect things), the empty string.
Anon.
Exact. Never mind the example, the secondary comment is still valid, though :)
Romain
Thanks, I corrected it.
Stephen Cross
+3  A: 

Take in parts. Here's the first part:

([!-~]|[ ])*

This means any number (*) of the characters between ! and ~ (including ! and ~; this turns out to be all of the printable ASCII characters, if you look up ! and ~ in an ASCII table) or a space.

Here's the second part:

[!-~]

This means one character between ! and ~

Here's the last part:

([!-~]|[ ])*

This means the same thing as the first part.

So this regular expression will match any string of printable ASCII characters, including spaces, provided there is at least one printable ASCII character in the string.

Dominic Cooney
I don't get it, what about the "|[ ]" part in first and last part of the regex? Does it mean nothing?
BeowulfOF
It's an alternative, either one non-space character (the part below the |), or a space character (the part after the |).
Romain
Why not just '[ -~]*[!-~][ -~]*' ?
dtmilano
@dtmilano: you're right, `[ -~]*[!-~][ -~]*` works just fine (at least it does in RegexBuddy when I specify "XML Schema" mode).
Alan Moore
In fact, `[ ]*[!-~][ -~]*` works too, and I think it's more readable as well as more efficient (not that efficiency is likely to be an issue).
Alan Moore
+2  A: 

The answers you've gotten seem to have missed one of the fundamentals of REs: a '-' inside square brackets isn't taken to mean a literal '-' unless it's the first or last character. Instead, the '-' defines a range. The '!' is (in ASCII, ISO 8859, etc.) character code 33 -- the first "visible" printable character. Likewise, in ASCII, the '~' is code 126, the last printable character.

Therefore, the "[!-~]" matches a single printable (ASCII) character.

For the rest, the other answers seem reasonable.

Edit: it looks like as I was writing this, some more accurate answers were posted -- my apologies if I offended anybody by implying otherwise. As I started writing this, the answers that had been posted were wrong on this point.

Jerry Coffin
That's what was tripping me up. I'm glad you and others explained that.
Dave
doh, I missed that small fact myself :(
bramp
+1  A: 

The regular expression consists of:

  • ([!-~]|[ ])* start with zero or more characters of the range from ! (0x21) to ~ (0x7E) or the space character (0x20), so basically all printable characters from 0x21 to 0x7E plus the space character
  • [!-~] followed by a single printable character
  • ([!-~]|[ ])* followed by zero or more printable characters or the space character

So it basically says that the string must only contain printable characters or the space character and there must be at least one printable character.

Gumbo