tags:

views:

623

answers:

8

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?

For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:

"hello[world]this[[is]me"

The output should be:

token[0] = "world"

token[1] = "[is"

(Note: the second token has a 'start' string in it)

A: 

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Charlie Martin
Sorry, which method is this? I don't see anything with something like 'include tokens' in the signature
digiarnie
I can't seem to find that in the docs either: http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html
Sev
It is in the 3-arguments constructor. Nonetheless, the result will be {"hello","[","world","]","this","[","[","is","]","me"} so additional work needs to take place.
David Rabinowitz
A: 

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

Rahul Garg
A: 

There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.

Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.

indyK1ng
yeah I was thinking that character-by-character may have to be the solution but was hoping to steer clear of that if possible - especially if there was an elegant pre-existing API for what I want already.
digiarnie
A: 

Try a regular expression like:

(.*?\[(.*?)\])

The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

Hawker
+1  A: 

Here is the way I would go to avoid dependency on commons lang.

public static String escapeRegexp(String regexp){
 String specChars = "\\$.*+?|()[]{}^";
 String result = regexp;
 for (int i=0;i<specChars.length();i++){
  Character curChar = specChars.charAt(i);
  result = result.replaceAll(
   "\\"+curChar,
   "\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
 }
 return result;
}

public static List<String> findGroup(String content, String pattern, int group) {
 Pattern p = Pattern.compile(pattern);
 Matcher m = p.matcher(content);
 List<String> result = new ArrayList<String>();
 while (m.find()) {
  result.add(m.group(group));
 }
 return result;
}


public static List<String> tokenize(String content, String firstToken, String lastToken){
 String regexp = lastToken.length()>1
     ?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
     :escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
 return findGroup(content, regexp, 1);
}

Use it like this :

String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
subtenante
Why reinvent the wheel though?
Jon
Because we live in a free world. And because you may not want to use a whole library for one method in it. And because I like it this way. Happy ?
subtenante
A: 

StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:

public List extractTokens(String txt, String str, String end) {
    int                      so=0,eo;
    List                     lst=new ArrayList();

    while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
        so+=str.length();
        if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
            lst.add(txt.substring(so,eo);
            so=eo+end.length();
            }
        }
    return lst;
    }
Software Monkey
+5  A: 

I think you can use the Apache Commons Lang feature that exists in StringUtils:

substringsBetween(java.lang.String str,
                  java.lang.String open,
                  java.lang.String close)

The API docs say it:

Searches a String for substrings delimited by a start and end tag, returning all matching substrings in an array.

The Commons Lang substringsBetween API can be found here:

http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

Jon
A: 

The regular expression \\[[\\[\\w]+\\] gives us [world] and [[is]

Babak Naffas