tags:

views:

463

answers:

6

I have a string with markup in it which I need to find using Java.

eg.

string = abc<B>def</B>ghi<B>j</B>kl

desired output..

segment [n] = start, end

segment [1] = 4, 6
segment [2] = 10, 10
+8  A: 

Regular expressions should work wonderfully for this.

Refer to your JavaDoc for

  • java.langString.split()
  • java.util.regex package
  • java.util.Scanner

Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.

Software Monkey
+2  A: 

The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.

Markus Lausberg
thanks Markus..for reference I found this..StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
pro
StringTokenizer splits around *characters*, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.
Software Monkey
+1  A: 

StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer

You also need to look at Scanner.

opyate
+1  A: 

Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.

Tom

String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);
}
Tom Duckering
this is great - matching the whole segment rather than just the start/end tags individually
pro
when I tried your regex in http://www.gskinner.com/RegExr/ it did not seem to match the segmentssomething simple like <B>.+</B> matches around the first <B> and last </B> so that's not the way
pro
You're . will match anything so it need to be more restrictive. Hence my use of [a-zA-Z]. I'm sure with a little tweaking and some understanding of what you can expect in between the <B> and </B> you should be able to nail it.
Tom Duckering
ah - replace [a-zA-Z] with [a-zA-Z]+
Tom Duckering
or.. <b>[^<]+</b> .. thanks Tom
pro
There shouldn't be any backslashes in that regex. By putting them in front of each parenthesis, you're telling the regex to match literal parentheses. The one in front of the forward-slash isn't hurting anything, but it isn't necessary.
Alan Moore
Alan - you're right. I think you need them for some other regex implementation. I do need an extra backslash for the forwardslash to tell java that the backslash is literal. What a palava. :)
Tom Duckering
+1  A: 

It is a bit 'Brute Force' and makes some assumptions but this works.

public class SegmentFinder
{

    public static void main(String[] args)
    {
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
        {           
            if (i > 0) // Ignore the first one
            {
                segmentCounter++;
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                {
                    currentPos += s.length();  
                }
            }
            else
            {
                currentPos += array[i].length();                
            }
        }
    }
}
Ron Tuffin
A: 

Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.

If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.

lutzh
I need to get the positions of the "<B>" and "</B>" in number of characters not including those tags. - see the example in the question
pro