ansaurus

Question

Perl RegEx: Limiting the pattern to only the first occurrence of a character

Answer 1

+2 A:

But from your example, probably you could try

if ($fileContent =~ m/<DATE[^>]*>([^<]+)/) {
  # use $1 here
  # you may need to strip new lines
}

KennyTM 2010-07-27 13:07:53

Hi Ken.Thanks for the regex, certainly worked.The reason I did not use any XML Parser is because there are about 20,000 SGML files I need to check. Their size about 50K each. If I have to parse them I think it is an overkill and will be slow.I might be able to use sax based parser but I am not a Perl expert so just try to do this task asap and move on.

Gilbeg 2010-07-28 00:36:26

Answer 2

A:

There is not any way to use regex over multiple lines, but you can use a little trick. If files aren't to big, as you have mentioned, you can first replace all '\n' characters with some value (NEW_LINE or something like that), or you can delete them and then use your pattern.

Klark 2010-07-27 13:12:01

There is. He's doing `local $/ = undef;` which does just that (well, it reads the whole file at once). Read up on Perl regexes in `perldoc perlre`.

MvanGeest 2010-07-27 13:13:20

Answer 3

+3 A:

Use an HTML parser.

Please, use an HTML parser.

But for a regex, I'd try

<DATE(.*?)>(.*)<\/DATE>

which should be faster than KennyTM's alternative... By the way, why are you capturing that second group?

MvanGeest 2010-07-27 13:12:22

Downvote because the question states this isn't XML.

daxim 2010-07-27 14:03:42

Ah, I hadn't noticed that. Still, there are some very resilient parsers around that can handle a huge mess.

MvanGeest 2010-07-27 14:05:40

There are HTML parsers that would do this job nicely.

Ether 2010-07-27 14:56:58

Thanks Ether. I'm still getting used to the idea that users can edit my answers, though. (I knew that when I signed up, but I always wondered how often it would happen, and why. Well, here's a legitimate reason.)

MvanGeest 2010-07-27 17:15:06

I have about 20K of SGML files, I just want to check their dates. If I have to parse them say using SGML::Parser then it would be an overkill and slow. Unless I am using SAX based parser.BTW, your regex indeed worked. Thanks!

Gilbeg 2010-07-28 00:40:07

Answer 4

+3 A:

If the date format is fixed, you might want to use something like this:

m/<DATE(.*)>([0-9]+(st|nd|rd|th)\s(January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]+)(.*)<\/DATE>/

dark_charlie 2010-07-27 13:12:28

Answer 5

+3 A:

instead of matching .*, you should match "everything that is not an anchor"

ie :


 if($string =~ /^<DATE[^>]*>([^<]+)</){

there, $1 is your date

benzebuth 2010-07-27 13:29:07

Many thanks... you are right like Kenny suggested.Thanks!

Gilbeg 2010-07-28 00:42:01

Answer 6

+3 A:

You should use non greedy matching and the modifier s to make . match newline

my @l = (
'<DATE>4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>'
);

foreach(@l) {
  /^<DATE.*?>(.*?)</s && print $1;
}

output:

4th July 1936
4th July 1936
4th July 1936

M42 2010-07-27 13:57:14

Answer 7

A:

Even your "hairy" example can be reduced to a similar type. If you are always going to have 1) the actual date on the same line as the start tag--and 2) that's all you want--it doesn't matter where the end tag is.

$fileContent =~ m/<DATE([^>]*)>\s*(\d+\p{Alpha}+\s+\p{Alpha}+\s+\d{4})/

is always going to work. (If you're not going to find '>' in the tag, then it's a good idea to not cause so much backtracking after .* eats up your entire line, causes the expression to fail and then has to give back and check, give back and check, ...)

Axeman 2010-07-27 18:45:13

ansaurus

tags:

views:

answers:

Perl RegEx: Limiting the pattern to only the first occurrence of a character

related questions