tags:

views:

918

answers:

10

I have had the need to use regular expressions only a few times in the work that I have done; however, in those few times I discovered a very powerful form of expression that would enable me to do some extremely useful things.

The problem is that the language used for regular expressions is wrong - full stop.

It is wrong from a psychological point of view - using disembodied symbols provides a useful reference only to those with an eidetic (photographic) memory. Whilst the syntactic rules are clearly laid out, from my experience and what I have learnt from others, evolving a regular expression that functions successfully can prove to be a difficult thing to do in all but the most trivial situations. This is understandable since it is a symbolic analog for set theory, which is a fairly complicated thing.

One of the things that can prove difficult is dissolving the expression that you are working on into it's discrete parts. Due to the nature of the language, it is possible to read one regular expression in multiple ways if you don't have an understanding of it's primary goal so interpreting other people's regexes is complicated. In natural language study I believe this is called pragmatics.

The question I'd like to ask then is this - is there such a thing as a regular expression compiler? Or can one even be built?

It could be possible to consider regexes, from a metaphorical point of view, as assembly language - there are some similarities. Could a compiler be designed that could turn a more natural language - a higher language - into regular expressions? Then in my code, I could define my regexes using the higher level language in a header file and reference them where necessary using a symbolic reference. I and others could refer from my code to the header file and more easily appreciate what I am trying to achieve with my regexes.

I know it can be done from a logical point of view otherwise computers wouldn't be possible but if you have read this far (which is unlikely :) then would you consider investing the time in realising it?

+4  A: 

I never stumbled across something like that. And I don't think that something like that would be useful.

That higher-level language would be very verbose and my guess is that you'd need pretty long statements to come up with a regular expression of average complexity.

Maybe you just haven't been using regular expressions often enough. Believe me, my memory is far from being eidetic (or even good), but I rarely have problems crafting regular expressions or understanding those of my coworkers.

innaM
+1  A: 

One way you can by pass this problem is by using programs like QuickREx, it shows how regex works on multiple test data(with highlights). You could save text data in file near your regex and latter when you want to change it, understand it or fix it that would be much easier.

01
A: 

Have you considered using a parser generator (aka compiler compiler) such as ANTLR?

ANTLR also has some kind of IDE (ANTLR Works) where you can visualize/debug parsers.

On the other hand a parser generator is not something to throw into you app in a few seconds like a regex - and it also would be total overkill for something like checking email address format.

Also for simple situations this would be total overkill and maybe a better way is just to write comments for your regex explaining what it does.

Fionn
+4  A: 

What about write them with Regex Buddy and paste the description it generates as comment on your code?

Andrea Ambu
+1: regex is extremely hard to read, but this is a tooling issue, not a language issue
Michael Haren
+4  A: 

1) Perl permits the /x switch on regular expressions to enable comments and whitespace to be included inside the regex itself. This makes it possible to spread a complex regex over several lines, using indentation to indicate block structure.

2) If you don't like the line-noise-resembling symbols themselves, it's not too hard to write your own functions that build regular expressions. E.g. in Perl:

sub at_start { '^'; }
sub at_end { '$'; }
sub any { "."; }
sub zero_or_more { "(?:$_[0])*"; }
sub one_or_more { "(?:$_[0])+"; }
sub optional { "(?:$_[0])?"; }
sub remember { "($_[0])"; }
sub one_of { "(?:" . join("|", @_) . ")"; }
sub in_charset { "[^$_[0]]"; }       # I know it's broken for ']'...
sub not_in_charset { "[^$_[0]]"; }   # I know it's broken for ']'...

Then e.g. a regex to match a quoted string (/^"(?:[^\\"]|\\.)*"/) becomes:

at_start .
'"' .
zero_or_more(
    one_of(
        not_in_charset('\\\\"'),    # Yuck, 2 levels of escaping required
        '\\\\' . any
    )
) .
'"'

Using this "string-building functions" strategy lends itself to expressing useful building blocks as functions (e.g. the above regex could be stored in a function called quoted_string(), you might have other functions for reliably matching any numeric value, an email address, etc.).

j_random_hacker
+3  A: 

There are ways to make REs in their usual form more readable (such as the perl /x syntax), and several much wordier languages for expressing them. See:

I note, however, that a lot of old hands don't seem to like them.

There is no fundamental reason you couldn't write a compiler for a wordy RE language targeting a compact one, but I don't see any great advantage in it. If you like the wordy form, just use it.

dmckee
+4  A: 

Regular Expressions (well, "real" regular expressions, none of that modern stuff;) are finite state machines. Therefore, you create a syntax that describes a regular expressions in terms of states, edges, input and possibly output labels. The fsmtools of AT&T support something like that, but they are far from a tool ready for everyday use.

The language in XFST, the Xerox finite state toolkit, is also more verbose.

Apart from that, I'd say that if your regular expression becomes too complex, you should move on to something with more expressive power.

Torsten Marek
+1  A: 

XML Schema's "content model" is an example of what you want.

c(a|d)+r

can be expressed as a content model in XML Schema as:

<sequence>
 <element name="c" type="xs:string"/>
 <choice minOccurs="1" maxOccurs="unbounded">
  <element name="a" type="xs:string"/>
  <element name="d" type="xs:string"/>     
 </choice>
 <element name="r" type="xs:string"/>
<sequence>

Relax NG has another way to express the same idea. It doesn't have to be an XML format itself (Relax NG also has an equivalent non-XML syntax).

The readability of regex is lowered by all the escaping necessary, and a format like the above reduces the need for that. Regex readability is also lowered when the regex becomes complex, because there is no systematic way to compose larger regular expressions from smaller ones (though you can concatenate strings). Modularity usually helps. But for me, the shorter syntax is tremendously easier to read (I often convert XML Schema content models into regex to help me work with them).

13ren
A: 

I agree that the line-noise syntax of regexps is a big problem, and frankly I don't understand why so many people accept or defend it, it's not human-readable.

Something you don't mention in your post, but which is almost as bad, is that nearly every language, editor, or tool has its own variation on regexp syntax. Some of them support POSIX syntax as it was defined so many years ago, some support Perl syntax as it is today. But many have their own independent ways of expressing things, or which characters are "special" (special characters is another topic) and which are not. What is escaped and what isn't. Etc. Not only is it difficult to read a regexp written for one language or tool, but even if you totally memorize the syntax rules for your favorite variation, they can trip you up in a different language, where {2,3} no longer means what you expect. It's truly a mess.

Furthermore, I think there are many non-programmers who (if they knew it existed) would appreciate having a pattern-matching language they could use in everyday tools like Google or Microsoft Word. But there would need to be an easier syntax for it.

So, to answer your question, I have often thought of making some kind of cross-platform, cross-language, cross-everything library that would allow you to "translate" from any regexp syntax (be it Perl, or POSIX, or Emacs, etc) into any other regexp syntax. So that you wouldn't have to worry if Python regexps could do negative look-behind, or if character-class brackets should be escaped in an Emacs regexp. You could just memorize one syntax, then make a function call to get out the equivalent syntax for whatever you happened to be using.

From there, it could be extended with a new pattern-matching language, that would be a bit more verbose or at least more mnemonic. Something for people who don't want to spend half-an-hour studying a regexp to figure out what it does. (And people who think regexps are fine as they are have obviously never had to maintain anything they didn't write themselves, or they would understand the need for other people to be able to parse what they've written.)

Will I ever attempt such a beast? I don't know, it's been on my to-do list for a long time, and there are a lot of easier and more entertaining projects on there as well. But if you are contemplating something similar, let me know.

A: 

regular expression compiler:

ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/83/972/CS-TR-83-972.pdf

Ben