ansaurus

Question

How can I match a repeating pattern with Java regular expressions?

Answer 1

+6 A:

You can just factor it out:

(\d+\.)(\d+\.)*html

jpalecek 2009-04-02 09:21:21

Answer 2

+3 A:

"^(\\d+)\\.(\\d+\\.)*html$"

izb 2009-04-02 09:54:15

+1. I'd take that one step more and make the 2nd group non-capturing -- ^(\d+)\.(?:\d+\.)*html$

ojrac 2009-04-02 17:01:05

Answer 3

A:

jpalecek's solution fails; it captures the rightmost number. The original poster was a lot closer, but he got the right-most number. To get the left-most number, ignore anything after the first dot:

[^\d]*(\d+)\..*html

[^\d]* ignores everything before the left-most number (so X1.html captures number 1) (\d+). captures the first digits, if they are followed by a dot. .* ignores everything between the dot and the final html.

MSalters 2009-04-02 09:56:39

Did you mean it fails to capture the *left-most* number? But you're assuming there can be other characters before the first bunch of digits. I don't see anything to support that assumption.

Alan Moore 2009-04-02 17:51:21

There are only two conditions, numbered 1 and 2. Condition 1 says there are numbers (digit strings) but nothing about characters before, between or after them. Condition 2 only says something about the last 4 characters. So, no assumption on my side. Fixed the first sentence though.

MSalters 2009-04-03 08:33:32

Answer 4

A:

Java style: "(\\d+)\\..*?\\.html$"

This will 1) grab the first group of consecutive digits, 2) require a dot after words, 3) jump over everything except 3) the literal string '.html'.

If you mean "one or more [groups] of numbers followed by a period" then this is more along the lines of your requirements.

"(\\d+)(?:\\.\\d+)*\\.html$"

This way you get a number and not the dot. And none of the other patterns need to be captured, so they are not.

Axeman 2009-04-02 16:57:05

Using .* when you know the expected string's pattern is one of my top regex turnoffs.

ojrac 2009-04-02 17:02:45

It fits the requirements as listed.

Axeman 2009-04-02 17:14:39

The wording of the problem is vague, but the sample data and code indicate that, aside from the "html" extension, the file name is expected to consist entirely of digits and dots. There's no need to resort to .* in this case.

Alan Moore 2009-04-02 17:34:46

+1 for the updated version. Grouping the dot with the *following* digits instead of the preceding digits seems more correct.

Alan Moore 2009-04-02 18:49:25

Actually, in most RE parsers .*? when followed by a literal is pretty darn fast. I don't code to samples, but I also didn't keep the title in mind while reading the "spec" either.

Axeman 2009-04-02 18:53:50

Speed is not the point. In addition to extracting the first group of digits, the OP is validating the overall form of the string: one or more groups of digits separated by dots.

Alan Moore 2009-04-03 15:43:38

Taking the sample to be the pattern, you're right. But the sample complies with the spec of having "one or more numbers followed by a period" ending in '.html'. Nothing specified about in between. I can write that as two patterns, or I can specify it in one. As I said, I work to specs, not samples.

Axeman 2009-04-03 18:21:31

ansaurus

tags:

views:

answers:

How can I match a repeating pattern with Java regular expressions?

related questions