views:

246

answers:

3

I can't seem to find decent documentation on haskell's POSIX implementation. Specifically the module Text.Regex.Posix.

Can anyone point me in the right direction of using multiline matching on a string?

A snippet for the curious:

> extractToken body = body =~ "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>" :: String

I'm trying to extract the source of wikipedia pages, however this method clearly falls over when more than one line is involved.

+2  A: 

You may need to use the PCRE backend instead if you want to do anything more flexible, or with better performance, than Posix regexes.

pcre-light and regex-pcre are both fine.

Don Stewart
Would definitely be the preferred choice, however our research group has to run this on our university server, who may or may not approve the addition of new modules.
Ian Elliott
A: 

I solved in this case by matching

((.*)|\n*)*

Although this may not always work depending on your expression. The above solution is probably the best way to go if you're able to.

Ian Elliott
+4  A: 

You may need to import Text.Regex.Base.RegexLike for access to makeRegexOpts and friends.

extractToken body = match regex body where
    regex = makeRegexOpts (defaultCompOpt - compNewline) defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

Well, since Text.Regex.Posix's defaultCompOpt = compExtended + compNewline, that works out equivalently as

extractToken body = match regex body where
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

To pull out just the first group, use one of the other instances of RegexLike. One possibility is

extractToken body = head groups where
    (preMatch, inMatch, postMatch, groups) =
        match regex body :: (String, String, String, [String])
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
ephemient
Works great, thanks. Also, is there any way to return just the match (.*), or is that only in PCRE?
Ian Elliott