views:

580

answers:

3

What's the best way to use regular expressions with options (flags) in Haskell

I use

Text.Regex.PCRE

The documentation lists a few interesting options like compCaseless, compUTF8, ... But I don't know how to use them with (=~)

+6  A: 

I believe cannot use (=~) if you wish to use compOpt other than defaultCompOpt.

Something like this work:

match (makeRegexOpts compCaseless defaultExecOpt  "(Foo)" :: Regex) "foo" :: Bool

The follow two articles should assist you:

Real World Haskell, Chapter 8. Efficient file processing, regular expressions, and file name matching

A Haskell regular expression tutorial

Dave Tapley
+11  A: 

All the Text.Regex.* modules make heavy use of typeclasses, which are there for extensibility and "overloading"-like behavior, but make usage less obvious from just seeing types.

Now, you've probably been started off from the basic =~ matcher.

(=~) ::
  ( RegexMaker Regex CompOption ExecOption source
  , RegexContext Regex source1 target )
  => source1 -> source -> target
(=~~) ::
  ( RegexMaker Regex CompOption ExecOption source
  , RegexContext Regex source1 target, Monad m )
  => source1 -> source -> m target

To use =~, there must exist an instance of RegexMaker ... for the LHS, and RegexContext ... for the RHS and result.

class RegexOptions regex compOpt execOpt | ...
      | regex -> compOpt execOpt
      , compOpt -> regex execOpt
      , execOpt -> regex compOpt
class RegexOptions regex compOpt execOpt
      => RegexMaker regex compOpt execOpt source
         | regex -> compOpt execOpt
         , compOpt -> regex execOpt
         , execOpt -> regex compOpt
  where
    makeRegex :: source -> regex
    makeRegexOpts :: compOpt -> execOpt -> source -> regex

A valid instance of all these classes (for example, regex=Regex, compOpt=CompOption, execOpt=ExecOption, and source=String) means it's possible to compile a regex with compOpt,execOpt options from some form source. (Also, given some regex type, there is exactly one compOpt,execOpt set that goes along with it. Lots of different source types are okay, though.)

class Extract source
class Extract source
      => RegexLike regex source
class RegexLike regex source
      => RegexContext regex source target
  where
    match :: regex -> source -> target
    matchM :: Monad m => regex -> source -> m target

A valid instance of all these classes (for example, regex=Regex, source=String, target=Bool) means it's possible to match a source and a regex to yield a target. (Other valid targets given these specific regex and source are Int, MatchResult String, MatchArray, etc.)

Put these together and it's pretty obvious that =~ and =~~ are simply convenience functions

source1 =~ source
  = match (makeRegex source) source1
source1 =~~ source
  = matchM (makeRegex source) source1

and also that =~ and =~~ leave no room to pass various options to makeRegexOpts.

You could make your own

(=~+) ::
   ( RegexMaker regex compOpt execOpt source
   , RegexContext regex source1 target )
   => source1 -> (source, compOpt, execOpt) -> target
source1 =~+ (source, compOpt, execOpt)
  = match (makeRegexOpts compOpt execOpt source) source1
(=~~+) ::
   ( RegexMaker regex compOpt execOpt source
   , RegexContext regex source1 target, Monad m )
   => source1 -> (source, compOpt, execOpt) -> m target
source1 =~~+ (source, compOpt, execOpt)
  = matchM (makeRegexOpts compOpt execOpt source) source1

which could be used like

"string" =~+ ("regex", CompCaseless + compUTF8, execBlank) :: Bool

or overwrite =~ and =~~ with methods which can accept options

import Text.Regex.PCRE hiding ((=~), (=~~))

class RegexSourceLike regex source
  where
    makeRegexWith source :: source -> regex
instance RegexMaker regex compOpt execOpt source
         => RegexSourceLike regex source
  where
    makeRegexWith = makeRegex
instance RegexMaker regex compOpt execOpt source
         => RegexSourceLike regex (source, compOpt, execOpt)
  where
    makeRegexWith (source, compOpt, execOpt)
      = makeRegexOpts compOpt execOpt source

source1 =~ source
  = match (makeRegexWith source) source1
source1 =~~ source
  = matchM (makeRegexWith source) source1

or you could just use match, makeRegexOpts, etc. directly where needed.

ephemient
Ah, it seems that I've been beat to the solution. That's what I get for writing all sorts of unnecessary stuff :-/
ephemient
Ah, I feel a little guilty now, yours certainly offers a much more comprehensive overview! I like you suggestion for (=~+) by the way.
Dave Tapley
it's a very complete and comprehensive answer indeed, I'd like to reward the effort, but I don't know if it's common practice to switch the "accepted answer" ? anyway, I'm new to Haskell, and this answer really helped me to understand some clever principles of the language (also, little typo at the begining you wrote =~ instead of =~~ )
Gaetan Dubar
It's perfectly acceptable to switch the accepted answer when a new and better answer is given. Though I must say that (?i) is a whole lot less typing.
Jan Goyvaerts
Right, I don't have Text.Regex.PCRE installed, and I wasn't thinking of that. This solution (or something very similar) should work for anything from Text.Regex.*.
ephemient
+2  A: 

I don't know anything about Haskell, but if you're using a regex library based on PCRE, then you can use mode modifiers inside the regular expression. To match "caseless" in a case insensitive fashion, you can use this regex in PCRE:

(?i)caseless

The mode modifier (?i) overrides any case sensitivity or case insensitivity option that was set outside the regular expression. It also works with operators that don't allow you to set any options.

Similarly, (?s) turns on "single line mode" which makes the dot match line breaks, (?m) turns on "multi line mode" which makes ^ and $ match at line breaks, and (?x) turns on free-spacing mode (unescaped spaces and line breaks outside character classes are insignificant). You can combine the letters. (?ismx) turns on everything. A hyphen turns off options. (?-i) makes the regex case sensitive. (?x-i) starts a free-spacing case sensitive regex.

Jan Goyvaerts
it works too ! it's much simpler but also less generic than the accepted solution
Gaetan Dubar