tags:

views:

143

answers:

7
+1  Q: 

Matching rounds...

Hello.

I have some text with the following structure:

Round 1

some multiline text ...

Round 2

some multiline text ...

...

Round N

some multiline text ...

I'd like to match rounds with their multiline text.

None of the expressions produces correct result:

(Round\s\d+)((?!Round).*?)

(Round\s\d+)(.*?)

Could someone help me?

Thank you in advance.

+1  A: 

The dot (.) character matches all characters except newlines by default. In many languages you can use the s modifier to make the dot match all characters, including newlines. It should look something like this:

/(Round\s\d+)(.*?)(Round\s\d+|$)/s

(Not 100% sure if this regex will work, I'm just showing you how to use the s modifier.)

Edit: Tested on regexpal.com and it appears to work.

yjerem
This regex will find round 1, round 3, round 5, etc, but not round 2, round 4, round 6, etc. because the headers of the even rounds are consumed in the regex matches of the odd rounds.
Jan Goyvaerts
+1  A: 

Using a regular expression directly on multiple lines may not be easy (in terms of readability and maintainability).

I would've processed the text line by line, and use a data structure to hold whatever has been seen so far. You can compare this to email processing when you have headers, body, etc.

PolyThinker
+1  A: 

Is this a C# question?

(Round\s\d+)(.*?)

Use RegexOptions.Singleline

Singleline Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

And you should probably use Matches instead of Match.

OIS
The .*? in your regex will match absolutely nothing, because there's nothing after the lazy star to force it to repeat more than zero times
Jan Goyvaerts
A: 

Thank you for the answers.

Sure I used SingleLine option. Just forgot to mention. I tested regex in Expresso.

A: 

It's rarely if ever correct to use a reluctant quantifier as the last thing in a regex. In this regex:

/(Round\s+\d+)(.*?)/s

...the first thing the (.*?) part does is try to match zero characters. That's a perfectly legal match, and because the quantifier is reluctant, it stops right there. If you're going to do it this way, there has to be something after the (.*?), like this:

/(Round\s+\d+)(.*?)(Round\s+\d+)/s

This way, the (.*?) can't stop at zero characters; it has to keep matching consuming characters until it reaches a spot where the next part of the regex - (Round\s+\d+) - can take over. But you don't want to use that regex because it consumes part of what's supposed to be the next match. Sticking to this format, you can use a lookahead as the ending condition:

/(Round\s+\d+)(.*?)(?=Round\s+\d+|$)/s

Now it's forced to match a whole entry, but the match position is left at the beginning of the next entry so the next match attempt will start there. (EDIT: added |$ to the lookahead to match the last entry.)

EDIT: I meant to comment on your other regex, too:

/(Round\s+\d+)((?!Round).*?)/s

Here, instead of using a positive lookahead as the ending condition, it looks like you're trying to use a preemptive negative lookahead. For that to work, the lookahead has to be performed at each position before the dot is allowed to consume a character. That means the dot has to be enclosed in parentheses with the lookahead, with the quantifier outside them:

/(Round\s+\d+)((?:(?!Round).)*)/s

You can't use a reluctant quantifier in this regex either, for the same reason as the other one.

There's probably a better way to do this, but I would need to know more about the data and your requirements before I could suggest anything.

(Note that I used Perl-like syntax, with the slash delimiters and trailing 's' modifier for single-line mode, because regexes tend to confuse the site's syntax highlighter without them.)

Alan Moore
A: 

This will do the trick with RegexOptions.SingleLine set:

Round\s+\d+(.*?)(?=Round\s\d|$)
Jan Goyvaerts
A: 

Alan, great tips for regular expressions. I had not enough practice with lookaheads.

/(Round\s+\d+)(.*?)(?=Round\s+\d+|$)/s does exactly what I need.

/(Round\s+\d+)((?!Round).)*/s works as well but causes every letter be a separate capture.

Thank you very much.

To describe my data more exactly you can look here for example: http://www.rsssf.com/tablesi/ital09.html

Actually I need to import into my database all information about rounds, matches, results, their dates.

I have another problem to solve: how to correlate my already stored teams with those that are in match results. For example, I have a team 'Inter' in my db. But match result can look like

Internazionale 1-1 Juventus or FC Inter 1-1 Juventus

In the future I'd like to make regex queries something like 'get all match results for Inter' in order not to look through the whole content.

So my idea was to store with each team their possible names (tags) and then combine them via |.

For example /(Inter|Internazionale|FC Inter)\s+\d+-\d+\d+(\w+)/s

Also I have doubt about (\w+) for any team match. I'm afraid that I have to concatenate all team name tags with | and use there. For 30 teams and 2-3 tags it will be a huge regex.

I appreciate your help.

That second regex should have been "/(Round\s+\d+)((?:(?!Round).)*)/s". I was trying to make the minimum changes necessary to get the regex to match, but I should have been thinking about the captures, too.
Alan Moore
As for the rest of your question, it sounds like you're trying to do too much with regexes. I would scan the whole page once, parse it, and store the info in a searchable data structure. If you want help with that, you should start a new thread.
Alan Moore