ansaurus

Question

Answer 1

+4 A:

I never stumbled across something like that. And I don't think that something like that would be useful.

That higher-level language would be very verbose and my guess is that you'd need pretty long statements to come up with a regular expression of average complexity.

Maybe you just haven't been using regular expressions often enough. Believe me, my memory is far from being eidetic (or even good), but I rarely have problems crafting regular expressions or understanding those of my coworkers.

innaM 2009-02-01 13:57:26

Answer 2

+1 A:

One way you can by pass this problem is by using programs like QuickREx, it shows how regex works on multiple test data(with highlights). You could save text data in file near your regex and latter when you want to change it, understand it or fix it that would be much easier.

01 2009-02-01 14:02:01

Answer 3

A:

Have you considered using a parser generator (aka compiler compiler) such as ANTLR?

ANTLR also has some kind of IDE (ANTLR Works) where you can visualize/debug parsers.

On the other hand a parser generator is not something to throw into you app in a few seconds like a regex - and it also would be total overkill for something like checking email address format.

Also for simple situations this would be total overkill and maybe a better way is just to write comments for your regex explaining what it does.

Fionn 2009-02-01 14:03:22

Answer 4

+4 A:

What about write them with Regex Buddy and paste the description it generates as comment on your code?

Andrea Ambu 2009-02-01 14:09:10

+1: regex is extremely hard to read, but this is a tooling issue, not a language issue

Michael Haren 2009-02-01 14:11:59

Answer 5

+4 A:

1) Perl permits the /x switch on regular expressions to enable comments and whitespace to be included inside the regex itself. This makes it possible to spread a complex regex over several lines, using indentation to indicate block structure.

2) If you don't like the line-noise-resembling symbols themselves, it's not too hard to write your own functions that build regular expressions. E.g. in Perl:

sub at_start { '^'; }
sub at_end { '$'; }
sub any { "."; }
sub zero_or_more { "(?:$_[0])*"; }
sub one_or_more { "(?:$_[0])+"; }
sub optional { "(?:$_[0])?"; }
sub remember { "($_[0])"; }
sub one_of { "(?:" . join("|", @_) . ")"; }
sub in_charset { "[^$_[0]]"; }       # I know it's broken for ']'...
sub not_in_charset { "[^$_[0]]"; }   # I know it's broken for ']'...

Then e.g. a regex to match a quoted string (/^"(?:[^\\"]|\\.)*"/) becomes:

at_start .
'"' .
zero_or_more(
    one_of(
        not_in_charset('\\\\"'),    # Yuck, 2 levels of escaping required
        '\\\\' . any
    )
) .
'"'

Using this "string-building functions" strategy lends itself to expressing useful building blocks as functions (e.g. the above regex could be stored in a function called quoted_string(), you might have other functions for reliably matching any numeric value, an email address, etc.).

j_random_hacker 2009-02-01 14:53:21

Answer 6

+3 A:

There are ways to make REs in their usual form more readable (such as the perl /x syntax), and several much wordier languages for expressing them. See:

I note, however, that a lot of old hands don't seem to like them.

There is no fundamental reason you couldn't write a compiler for a wordy RE language targeting a compact one, but I don't see any great advantage in it. If you like the wordy form, just use it.

dmckee 2009-02-01 15:06:59

Answer 7

+4 A:

Regular Expressions (well, "real" regular expressions, none of that modern stuff;) are finite state machines. Therefore, you create a syntax that describes a regular expressions in terms of states, edges, input and possibly output labels. The fsmtools of AT&T support something like that, but they are far from a tool ready for everyday use.

The language in XFST, the Xerox finite state toolkit, is also more verbose.

Apart from that, I'd say that if your regular expression becomes too complex, you should move on to something with more expressive power.

Torsten Marek 2009-02-01 15:10:51

Answer 8

+1 A:

XML Schema's "content model" is an example of what you want.

c(a|d)+r

can be expressed as a content model in XML Schema as:

<sequence>
 <element name="c" type="xs:string"/>
 <choice minOccurs="1" maxOccurs="unbounded">
  <element name="a" type="xs:string"/>
  <element name="d" type="xs:string"/>     
 </choice>
 <element name="r" type="xs:string"/>
<sequence>

Relax NG has another way to express the same idea. It doesn't have to be an XML format itself (Relax NG also has an equivalent non-XML syntax).

The readability of regex is lowered by all the escaping necessary, and a format like the above reduces the need for that. Regex readability is also lowered when the regex becomes complex, because there is no systematic way to compose larger regular expressions from smaller ones (though you can concatenate strings). Modularity usually helps. But for me, the shorter syntax is tremendously easier to read (I often convert XML Schema content models into regex to help me work with them).

13ren 2009-02-01 16:52:54

Answer 9

A:

I agree that the line-noise syntax of regexps is a big problem, and frankly I don't understand why so many people accept or defend it, it's not human-readable.

Something you don't mention in your post, but which is almost as bad, is that nearly every language, editor, or tool has its own variation on regexp syntax. Some of them support POSIX syntax as it was defined so many years ago, some support Perl syntax as it is today. But many have their own independent ways of expressing things, or which characters are "special" (special characters is another topic) and which are not. What is escaped and what isn't. Etc. Not only is it difficult to read a regexp written for one language or tool, but even if you totally memorize the syntax rules for your favorite variation, they can trip you up in a different language, where {2,3} no longer means what you expect. It's truly a mess.

Furthermore, I think there are many non-programmers who (if they knew it existed) would appreciate having a pattern-matching language they could use in everyday tools like Google or Microsoft Word. But there would need to be an easier syntax for it.

So, to answer your question, I have often thought of making some kind of cross-platform, cross-language, cross-everything library that would allow you to "translate" from any regexp syntax (be it Perl, or POSIX, or Emacs, etc) into any other regexp syntax. So that you wouldn't have to worry if Python regexps could do negative look-behind, or if character-class brackets should be escaped in an Emacs regexp. You could just memorize one syntax, then make a function call to get out the equivalent syntax for whatever you happened to be using.

From there, it could be extended with a new pattern-matching language, that would be a bit more verbose or at least more mnemonic. Something for people who don't want to spend half-an-hour studying a regexp to figure out what it does. (And people who think regexps are fine as they are have obviously never had to maintain anything they didn't write themselves, or they would understand the need for other people to be able to parse what they've written.)

Will I ever attempt such a beast? I don't know, it's been on my to-do list for a long time, and there are a lot of easier and more entertaining projects on there as well. But if you are contemplating something similar, let me know.

2009-03-04 02:23:14

Answer 10

A:

regular expression compiler:

ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/83/972/CS-TR-83-972.pdf

Ben 2009-11-04 10:08:24

ansaurus

tags:

views:

answers:

Regular Expression Compiler

related questions