ansaurus

Question

Finding tokens in a Java String

Answer 1

A:

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Charlie Martin 2009-06-19 05:25:30

Sorry, which method is this? I don't see anything with something like 'include tokens' in the signature

digiarnie 2009-06-19 05:30:53

I can't seem to find that in the docs either: http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

Sev 2009-06-19 05:36:20

It is in the 3-arguments constructor. Nonetheless, the result will be {"hello","[","world","]","this","[","[","is","]","me"} so additional work needs to take place.

David Rabinowitz 2009-06-19 05:45:26

Answer 2

A:

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

Rahul Garg 2009-06-19 05:28:52

Answer 3

A:

There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.

Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.

indyK1ng 2009-06-19 05:32:20

yeah I was thinking that character-by-character may have to be the solution but was hoping to steer clear of that if possible - especially if there was an elegant pre-existing API for what I want already.

digiarnie 2009-06-19 05:34:35

Answer 4

A:

Try a regular expression like:

(.*?\[(.*?)\])

The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

Hawker 2009-06-19 05:41:33

Answer 5

+1 A:

Here is the way I would go to avoid dependency on commons lang.

public static String escapeRegexp(String regexp){
 String specChars = "\\$.*+?|()[]{}^";
 String result = regexp;
 for (int i=0;i<specChars.length();i++){
  Character curChar = specChars.charAt(i);
  result = result.replaceAll(
   "\\"+curChar,
   "\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
 }
 return result;
}

public static List<String> findGroup(String content, String pattern, int group) {
 Pattern p = Pattern.compile(pattern);
 Matcher m = p.matcher(content);
 List<String> result = new ArrayList<String>();
 while (m.find()) {
  result.add(m.group(group));
 }
 return result;
}


public static List<String> tokenize(String content, String firstToken, String lastToken){
 String regexp = lastToken.length()>1
     ?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
     :escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
 return findGroup(content, regexp, 1);
}

Use it like this :

String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");

subtenante 2009-06-19 05:43:45

Why reinvent the wheel though?

Jon 2009-06-20 21:33:06

Because we live in a free world. And because you may not want to use a whole library for one method in it. And because I like it this way. Happy ?

subtenante 2009-06-20 22:27:03

Answer 6

A:

StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:

public List extractTokens(String txt, String str, String end) {
    int                      so=0,eo;
    List                     lst=new ArrayList();

    while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
        so+=str.length();
        if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
            lst.add(txt.substring(so,eo);
            so=eo+end.length();
            }
        }
    return lst;
    }

Software Monkey 2009-06-19 05:44:38

Answer 7

+5 A:

I think you can use the Apache Commons Lang feature that exists in StringUtils:

substringsBetween(java.lang.String str,
                  java.lang.String open,
                  java.lang.String close)

The API docs say it:

Searches a String for substrings delimited by a start and end tag, returning all matching substrings in an array.

The Commons Lang substringsBetween API can be found here:

http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

Jon 2009-06-19 05:58:34

Answer 8

A:

The regular expression \\[[\\[\\w]+\\] gives us [world] and [[is]

Babak Naffas 2009-06-19 20:59:17

ansaurus

tags:

views:

answers:

Finding tokens in a Java String

related questions