tags:

views:

349

answers:

4

I know this question seems stupid, but it isn't. I mean what is it exactly. I have a fair understanding of the parsing problem. I know BNF/EBNF, I've written grammar to parse simple context-free languages in one of my college courses. I just never met regular expressions before! The only thing that I remember about it is that context-free grammar can do all what regular expression can do.

Also, is it useful for a usual coding to parse strings? A simple example would be helpful.

+14  A: 

A regular expression is a specialized language for pattern matching. They're used in many text editors and programming languages for string matching.

You can do many more complicated things with regular expressions as well. There's a great O'Reilly book on the subject, and numerous examples on the web.

The thing that you can't do with regular expressions is proper parsing, because regular expressions aren't a sufficient language to encode a grammar. They're specialized for pattern matching, and if you try to use them for parsing something like XML, you'll likely have problems down the road. More specifically, you can't parse arbitrarily nested recursive structures using regular expressions. A simple example of a problem that a regular expression can't solve well is a set of nested braces like you would find in C:

int main() {    
    void func() {
    }   
}

You can make regular expressions solve this up to a certain point, but the memory requirements for this grow arbitrarily large as the number of braces grows. If you're interested in more detail, read this other StackOverflow question about why such a construct is difficult to parse with regular expressions:

Can regular expressions be used to match nested patterns?

Different languages implement regular expressions in different ways, but the Perl implementation is very popular. The family of regular expressions that are compatible with Perl are called PCRE, or Perl-Compatible Regular Expressions. Here's an example in Perl of a regular expression that can match integers:

#!/usr/bin/perl

use strict;
use warnings;

match_string( "one-two" );
match_string( "1-2" );

sub match_string {
   my $string = shift;
   if ( $string =~ /(\d+)/ ) {
      print "$string matches!\n";
      print "matched: ", $1, "\n";
   } else {
      print "$string doesn't match!\n";
   }
}  

$ perl test.pl 
one-two doesn't match!
1-2 matches!
matched: 1

In this example, the regular expression matches one or more examples of a digit. Here's the line:

   if ( $string =~ /(\d+)/ ) {

The way to read this is:

  • inside the conditional, the string is being matched against the regular expression between /'s.
  • the \d character translates to a digit, 0-9.
  • the + means "one or more times."
  • the parens () mean capture this match, and put it into a special variable. Because this is the first match, it's put into $1.

In some languages (such as Perl), you can also use regular expressions for doing substitutions, like this:

substitute_string( "one-two" );
substitute_string( "1-2" );

sub substitute_string {
   my $string = shift;
   print "before: ",  $string, "\n";
   $string =~ s/1/one/g;
   $string =~ s/2/two/g;
   print "after: ",  $string, "\n";
}

$ perl test.pl 
before: one-two
after: one-two
before: 1-2
after: one-two

Hopefully that's enough to get you started!

James Thompson
Well explained. Please if you perfect the answer more and give a glimpse about the what regular expressions can not do but context-free grammar can.
AraK
Perl6 Grammar Engine can actually parse any arbitrarily nested recursive structures. You can also do that with the new features of `Perl5.10`, it's just not as elegant.
Brad Gilbert
A: 

a regular expression is a string that a "regular expression matcher" can read and learn how to identify particular patterns in large texts.

Q1: You have a large HTML page. you want a matcher to match all unordered lists (<ul>s). How do you do it?

Answer: Regular expressions.

Q2: You have a long list of names, and you want to find people whose first name is Chang (but NOT the last name). How do you do it?

Answer: Regular expressions.

Google about regular expressions to learn more.

Here Be Wolves
-1 because the first answer is blatantly wrong. HTML is not a regular language.
Svante
-1 This answer is totally missing out on what a regular expression is, and it's roots in automata and language theory.
Yuval A
@svante He never said HTML is a regular language. If you've converted the HTML to a string and there are no nested lists, one can use regular expressions to find the ul's. In Explorer this is often faster than using the DOM. He was answering the question.
SamGoody
+1  A: 

Other people have covered what a regular expression is, and what it can be used for, so I won't rehash previous answers. However, if you're interested in learning about regular expression syntax (ie. how to construct a regular expression), check out the Tutorial section at regular-expression.info; it's probably the most indepth regular expression syntax resource on the internet.

Daniel Vandersluis
+28  A: 

Regular expressions first came around in mathematics and automata theory. A regular expression is simply something which defines a regular language. Without going too much into what "regular" means, think of a language as this way:

  1. A language is made up of strings. English is a language, for example, and its made of strings.
  2. Those strings are made of symbols - called an alphabet. So a string is just a concatenation of symbols from the alphabet.

So you could have a string (which is, remember, just a concatenation of symbols) which is not part of a given language. Or it could be in the language.

So lets say you have an alphabet made of 2 symbols: "0" and "1". And lets say you want to create a language using the symbols in that alphabet. You could create the following rule: "In order for a string to be in my language, it must have only 0's and 1's in it."

So these strings are in your language:

  • 0
  • 1
  • 01
  • 11001101
  • ...etc

These would not be in your language:

  • 2
  • peaches
  • 00101105

That's a pretty simple language. How about this: "In my language, each string [analogous to a valid 'word' in English] must being with a 0, and then can be followed by any number of 0's or 1's"

These are in the language:

  • 0111111
  • 0000000
  • 0101010110001

These are not:

  • 1
  • 10000
  • 1010
  • 2000000

Well rather than defining the language using words - and these languages might get very complex ("1 followed by 2 0's followed by any combination of 1's and 0's ending with a 1"), we came up with this syntax called "regular expressions" to define the language.

The first language would have been:

(0|1)*

(0 or 1, repeated infinitely)

The next: 0(0|1)*

(0, followed by any number of 0's and 1's).

So lets think of programming now. When you create a regex, you are saying "Look at this text. Return to me strings which match this pattern." Which is really saying "I have defined a language. Return to me all strings within this document which are in my language."

So when you create a "regex", you are actually defining a regular language, which is a mathematical concept. (In actuality, perl-like regex define "nonregular" languages, but that is a separate issue.)

By learning the syntax of regex, you are learning the ins and outs of how to create a language, so that later you can see if a given string is "in" the language. Thus, commonly, people say that regex are for pattern matching - which is basically what you are doing when you look at a pattern, and see if it "matches" the rules for your language.

(this was long. does it answer your question at all?)

rascher
Good answer -- this gets to the heart of the matter. Intuitively, regular expressions have no memory. A pure regular expression can't match parenthesis to an arbitary depth but a context free language can do that easily. However, most regular expression implementations happily let you do more than the technical definition.
George Phillips
Beautiful explanation. George filled the last tiny gap :)
AraK
BTW, is the answer to the 'third language is "100(0|1)*1" ? :)
AraK
Yep, that looks right to me. Also worth noting: perl regex syntax is different from math regex syntax. But some of it is similar, and the concepts are very similar too.
rascher
This is a really good answer on the underlying theory behind regexes. :)
James Thompson
Please keep in mind though, that what most programming languages actually implement under the name "regular expression" has nothing to do with the *actual* definition of "regular expression". This is the reason why, for example, Larry Wall, the creator of Perl, has renamed regular expressions to "regex" in Perl6, to avoid any confusion: http://dev.perl.org/perl6/doc/design/apo/A05.html
Jörg W Mittag