ansaurus

Question

What is the best way to find specific tokens in a string (in Java)?

Answer 1

+8 A:

Regular expressions should work wonderfully for this.

Refer to your JavaDoc for

java.langString.split()
java.util.regex package
java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

Software Monkey 2009-01-06 10:11:54

Answer 2

+2 A:

The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.

Markus Lausberg 2009-01-06 10:16:33

thanks Markus..for reference I found this..StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

pro 2009-01-06 10:28:47

StringTokenizer splits around *characters*, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

Software Monkey 2009-01-06 10:43:43

Answer 3

+1 A:

StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer

You also need to look at Scanner.

opyate 2009-01-06 10:35:38

Answer 4

+1 A:

Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.

Tom

String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);
}

Tom Duckering 2009-01-06 11:07:56

this is great - matching the whole segment rather than just the start/end tags individually

pro 2009-01-06 11:26:27

when I tried your regex in http://www.gskinner.com/RegExr/ it did not seem to match the segmentssomething simple like .+ matches around the first and last so that's not the way

pro 2009-01-06 11:36:41

You're . will match anything so it need to be more restrictive. Hence my use of [a-zA-Z]. I'm sure with a little tweaking and some understanding of what you can expect in between the and you should be able to nail it.

Tom Duckering 2009-01-06 11:43:19

ah - replace [a-zA-Z] with [a-zA-Z]+

Tom Duckering 2009-01-06 11:44:07

or.. [^<]+ .. thanks Tom

pro 2009-01-06 12:10:08

There shouldn't be any backslashes in that regex. By putting them in front of each parenthesis, you're telling the regex to match literal parentheses. The one in front of the forward-slash isn't hurting anything, but it isn't necessary.

Alan Moore 2009-01-06 13:00:19

Alan - you're right. I think you need them for some other regex implementation. I do need an extra backslash for the forwardslash to tell java that the backslash is literal. What a palava. :)

Tom Duckering 2009-01-06 13:47:47

Answer 5

+1 A:

It is a bit 'Brute Force' and makes some assumptions but this works.

public class SegmentFinder
{

    public static void main(String[] args)
    {
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
        {           
            if (i > 0) // Ignore the first one
            {
                segmentCounter++;
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                {
                    currentPos += s.length();  
                }
            }
            else
            {
                currentPos += array[i].length();                
            }
        }
    }
}

Ron Tuffin 2009-01-06 11:16:23

Answer 6

A:

Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "", "") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.

lutzh 2009-01-06 14:47:52

I need to get the positions of the "" and "" in number of characters not including those tags. - see the example in the question

pro 2009-01-12 16:38:44

ansaurus

tags:

views:

answers:

What is the best way to find specific tokens in a string (in Java)?

related questions