ansaurus

Question

Start regexp matching in the middle of an string in Java.

Answer 1

A:

Is your application performance-critical enough that a str.substring(pos) would matter? A regex is going to be multiple orders of magnitude slower than a substring, so rather than making your regex more complicated just break it up. That would be my approach.

Andrew 2010-08-01 17:40:14

Answer 2

+5 A:

Matcher.find(int start) would be useful for you

Marimuthu Madasamy 2010-08-01 17:41:59

Thanks, it's looks like it's exactly what I need. I guess I should have looked better before asking.naikus/polygenelubricants answer is also suitable, but I don't need that much control. Thanks!

dispose 2010-08-01 18:37:19

@dispose, Welcome!

Marimuthu Madasamy 2010-08-01 18:45:51

À comment if anyone has the same question: Using `find()`, the `^` character matches the beginning of the whole string, not the beginning of the search. To make sure you match the beginning of the string, use `region().lookingAt()` (explained below).

dispose 2010-08-01 23:45:16

Answer 3

+1 A:

How about using Matcher.region(int start, int end)

The javadoc says:

Sets the limits of this matcher's region. The region is the part of the input sequence that will be searched to find a match. Invoking this method resets the matcher, and then sets the region to start at the index specified by the start parameter and end at the index specified by the end parameter.

naikus 2010-08-01 17:44:17

+1 great find; I've elaborated in my answer with examples.

polygenelubricants 2010-08-01 17:56:38

Answer 4

+4 A:

A java.util.regex.Matcher tries to find matches on a region, which defaults to the entire input, but may be explicitly set to a specific subrange.

From the documentation:

A matcher finds matches in a subset of its input called the region. By default, the region contains all of the matcher's input. The region can be modified via the region(int start, int end) method and queried via the regionStart and regionEnd methods. The way that the region boundaries interact with some pattern constructs can be changed. See useAnchoringBounds and useTransparentBounds for more details.

Remember that like many methods in Java library classes, the start index is inclusive but the end index is exclusive.

Snippet

Here's an example usage:

    String text = "012 456 890 234";
    Pattern ddd = Pattern.compile("\\d{3}");
    Matcher m = ddd.matcher(text).region(3, 12);
    while (m.find()) {
        System.out.printf("[%s] [%d,%d)%n",
            m.group(),
            m.start(),
            m.end()
        );
    }

The above prints (as seen on ideone.com):

[456] [4,7)
[890] [8,11)

On anchoring bounds and transparent bounds

As previously mentioned, when you specify a region, you can change the behavior of some pattern constructs depending on what you need.

An anchoring bound makes the boundary of the region match various boundary matchers (^, $, etc).

An opaque bound essentially cuts off the rest of the input from lookaheads, lookbehinds, and certain boundary matching constructs. On the other hand, in transparent mode, they are allowed to see characters outside of the region as necessary.

By default, a Matcher uses both anchoring and opaque bounds. This is applicable to most subregion matching scenarios, but you can set your own combination depending on your need.

polygenelubricants 2010-08-01 17:56:13

Answer 5

A:

String.substring is a constant time operation; the character data is not copied, but shared with the original string. From the JDK source code:

// Package private constructor which shares value array for speed.
String(int offset, int count, char value[]) {
this.value = value;
this.offset = offset;
this.count = count;
}

public String substring(int beginIndex, int endIndex) {
// error checking omitted
return ((beginIndex == 0) && (endIndex == count)) ? this :
    new String(offset + beginIndex, endIndex - beginIndex, value);
}

So there is nothing to worry about in terms of substring performance.

meriton 2010-08-01 17:58:17

This is `true` about `substring` being `O(1)` time and space, but then indices will need to be renumbered to translate back to the global indices, which is a hassle and error prone.

polygenelubricants 2010-08-01 17:59:37

Answer 6

+1 A:

The region() method is what you're looking for. Each time you match something, you move the region's start position up to wherever that match ended. As far as the Matcher is concerned, that's now the beginning of the input.

If you set the useAnchoringBounds() option, you can treat the start of the region as if it were the beginning of the text (i.e., ^ or \A will match there), and if you set useTransparentBounds(), lookbehinds and word boundaries will still be able too "see" the preceding text. And you can use both options at once.

If you always want the next match to start precisely at the beginning of the region, you can even use lookingAt() instead of find()--the only good use I've ever found for that method. ;)

Alan Moore 2010-08-01 18:12:48

ansaurus

tags:

views:

answers:

Start regexp matching in the middle of an string in Java.

Snippet

On anchoring bounds and transparent bounds

related questions