tags:

views:

1039

answers:

4

Given the following input string 3481.7.1071.html

I want to confirm that

  1. The string has 1 or more numbers followed by a period.
  2. The string ends in html.

Finally, I want to extract the left-most number (i.e. 3481).

My current regex is nearly there but I can't capture the correct group:

final Pattern p = Pattern.compile("(\\d++\\.)+html");   
final Matcher m = p.matcher("3481.7.1071.html");
if (m.matches()) {
    final String corrected = m.group(1)+"html"; // WRONG! Gives 1071.html
}

How do I capture the first match?

+6  A: 

You can just factor it out:

(\d+\.)(\d+\.)*html
jpalecek
+3  A: 
"^(\\d+)\\.(\\d+\\.)*html$"
izb
+1. I'd take that one step more and make the 2nd group non-capturing -- ^(\d+)\.(?:\d+\.)*html$
ojrac
A: 

jpalecek's solution fails; it captures the rightmost number. The original poster was a lot closer, but he got the right-most number. To get the left-most number, ignore anything after the first dot:

[^\d]*(\d+)\..*html

[^\d]* ignores everything before the left-most number (so X1.html captures number 1) (\d+). captures the first digits, if they are followed by a dot. .* ignores everything between the dot and the final html.

MSalters
Did you mean it fails to capture the *left-most* number? But you're assuming there can be other characters before the first bunch of digits. I don't see anything to support that assumption.
Alan Moore
There are only two conditions, numbered 1 and 2. Condition 1 says there are numbers (digit strings) but nothing about characters before, between or after them. Condition 2 only says something about the last 4 characters. So, no assumption on my side. Fixed the first sentence though.
MSalters
A: 

Java style: "(\\d+)\\..*?\\.html$"

This will 1) grab the first group of consecutive digits, 2) require a dot after words, 3) jump over everything except 3) the literal string '.html'.

If you mean "one or more [groups] of numbers followed by a period" then this is more along the lines of your requirements.

"(\\d+)(?:\\.\\d+)*\\.html$"

This way you get a number and not the dot. And none of the other patterns need to be captured, so they are not.

Axeman
Using .* when you know the expected string's pattern is one of my top regex turnoffs.
ojrac
It fits the requirements as listed.
Axeman
The wording of the problem is vague, but the sample data and code indicate that, aside from the "html" extension, the file name is expected to consist entirely of digits and dots. There's no need to resort to .* in this case.
Alan Moore
+1 for the updated version. Grouping the dot with the *following* digits instead of the preceding digits seems more correct.
Alan Moore
Actually, in most RE parsers .*? when followed by a literal is pretty darn fast. I don't code to samples, but I also didn't keep the title in mind while reading the "spec" either.
Axeman
Speed is not the point. In addition to extracting the first group of digits, the OP is validating the overall form of the string: one or more groups of digits separated by dots.
Alan Moore
Taking the sample to be the pattern, you're right. But the sample complies with the spec of having "one or more numbers followed by a period" ending in '.html'. Nothing specified about in between. I can write that as two patterns, or I can specify it in one. As I said, I work to specs, not samples.
Axeman