tags:

views:

83

answers:

6

I'm parsing a relatively complex expression in Java using regexps + some manual parsing. What I'm doing right now is removing what I've already parsed from the string, so I have the next thing to parse right at the beginning of the string.

I would like to change this so I have a int pos variable and I don't modify the string. However, neither the Pattern nor the Matcher classes seem to have something to mark the index of the first character to match. Is there any way to do it?

(I know that I can just pass str.substring(pos) to the Matcher, but I guess it's much more expensive, and complicates my code a bit, since I'm using the start() and end() methods frequently).

A: 

Is your application performance-critical enough that a str.substring(pos) would matter? A regex is going to be multiple orders of magnitude slower than a substring, so rather than making your regex more complicated just break it up. That would be my approach.

Andrew
+5  A: 

Matcher.find(int start) would be useful for you

Marimuthu Madasamy
Thanks, it's looks like it's exactly what I need. I guess I should have looked better before asking.naikus/polygenelubricants answer is also suitable, but I don't need that much control. Thanks!
dispose
@dispose, Welcome!
Marimuthu Madasamy
À comment if anyone has the same question: Using `find()`, the `^` character matches the beginning of the whole string, not the beginning of the search. To make sure you match the beginning of the string, use `region().lookingAt()` (explained below).
dispose
+1  A: 

How about using Matcher.region(int start, int end)

The javadoc says:

Sets the limits of this matcher's region. The region is the part of the input sequence that will be searched to find a match. Invoking this method resets the matcher, and then sets the region to start at the index specified by the start parameter and end at the index specified by the end parameter.

naikus
+1 great find; I've elaborated in my answer with examples.
polygenelubricants
+4  A: 

A java.util.regex.Matcher tries to find matches on a region, which defaults to the entire input, but may be explicitly set to a specific subrange.

From the documentation:

A matcher finds matches in a subset of its input called the region. By default, the region contains all of the matcher's input. The region can be modified via the region(int start, int end) method and queried via the regionStart and regionEnd methods. The way that the region boundaries interact with some pattern constructs can be changed. See useAnchoringBounds and useTransparentBounds for more details.

Remember that like many methods in Java library classes, the start index is inclusive but the end index is exclusive.


Snippet

Here's an example usage:

    String text = "012 456 890 234";
    Pattern ddd = Pattern.compile("\\d{3}");
    Matcher m = ddd.matcher(text).region(3, 12);
    while (m.find()) {
        System.out.printf("[%s] [%d,%d)%n",
            m.group(),
            m.start(),
            m.end()
        );
    }

The above prints (as seen on ideone.com):

[456] [4,7)
[890] [8,11)

On anchoring bounds and transparent bounds

As previously mentioned, when you specify a region, you can change the behavior of some pattern constructs depending on what you need.

An anchoring bound makes the boundary of the region match various boundary matchers (^, $, etc).

An opaque bound essentially cuts off the rest of the input from lookaheads, lookbehinds, and certain boundary matching constructs. On the other hand, in transparent mode, they are allowed to see characters outside of the region as necessary.

By default, a Matcher uses both anchoring and opaque bounds. This is applicable to most subregion matching scenarios, but you can set your own combination depending on your need.

polygenelubricants
A: 

String.substring is a constant time operation; the character data is not copied, but shared with the original string. From the JDK source code:

// Package private constructor which shares value array for speed.
String(int offset, int count, char value[]) {
this.value = value;
this.offset = offset;
this.count = count;
}

public String substring(int beginIndex, int endIndex) {
// error checking omitted
return ((beginIndex == 0) && (endIndex == count)) ? this :
    new String(offset + beginIndex, endIndex - beginIndex, value);
}

So there is nothing to worry about in terms of substring performance.

meriton
This is `true` about `substring` being `O(1)` time and space, but then indices will need to be renumbered to translate back to the global indices, which is a hassle and error prone.
polygenelubricants
+1  A: 

The region() method is what you're looking for. Each time you match something, you move the region's start position up to wherever that match ended. As far as the Matcher is concerned, that's now the beginning of the input.

If you set the useAnchoringBounds() option, you can treat the start of the region as if it were the beginning of the text (i.e., ^ or \A will match there), and if you set useTransparentBounds(), lookbehinds and word boundaries will still be able too "see" the preceding text. And you can use both options at once.

If you always want the next match to start precisely at the beginning of the region, you can even use lookingAt() instead of find()--the only good use I've ever found for that method. ;)

Alan Moore