I have a string with markup in it which I need to find using Java.
eg.
string = abc<B>def</B>ghi<B>j</B>kl
desired output..
segment [n] = start, end
segment [1] = 4, 6
segment [2] = 10, 10
I have a string with markup in it which I need to find using Java.
eg.
string = abc<B>def</B>ghi<B>j</B>kl
desired output..
segment [n] = start, end
segment [1] = 4, 6
segment [2] = 10, 10
Regular expressions should work wonderfully for this.
Refer to your JavaDoc for
Note: StringTokenizer is not what you want since it splits around characters, not strings - the string delim is a list of characters, any one of which will split. It's good for the very simple cases like an unambiguous comma separated list.
The StringTokenizer will give you separate tokens when you want to separate the string by a specific string. Or you can use the split() method in String to get the separate Strings. To get the different arrays you have to put a regular expression into.
StringTokenizer takes the whole String as an argument, and is not really a good idea for big strings. You can also use StreamTokenizer
You also need to look at Scanner.
Given your example I think I'd use regex and particularly I'd look at the grouping functionality offered by Matcher.
Tom
String inputString = "abc<B>def</B>ghi<B>j</B>kl";
String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";
Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);
if (matcher.matches()) {
String firstGroup = matcher.group(1);
String secondGroup = matcher.group(2);
String thirdGroup = matcher.group(3);
}
It is a bit 'Brute Force' and makes some assumptions but this works.
public class SegmentFinder
{
public static void main(String[] args)
{
String string = "abc<B>def</B>ghi<B>j</B>kl";
String startRegExp = "<B>";
String endRegExp = "</B>";
int segmentCounter = 0;
int currentPos = 0;
String[] array = string.split(startRegExp);
for (int i = 0; i < array.length; i++)
{
if (i > 0) // Ignore the first one
{
segmentCounter++;
//this assumes that every start will have exactly one end
String[] array2 = array[i].split(endRegExp);
int elementLenght = array2[0].length();
System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
for(String s : array2)
{
currentPos += s.length();
}
}
else
{
currentPos += array[i].length();
}
}
}
}
Does your input look like your example, and you need to get the text between specific tags? Then a simple StringUtils.substringsBetween(yourString, "<B>", "</B>") using the apache commons lang package (http://commons.apache.org/lang/) should do the job.
If you're up for a more general solution, for different and possibly nested tags, you might want to look at a parser that takes html input and creates an xml document out of it, such as NekoHTML, TagSoup, jTidy. You can then use XPath on the xml document to access the contents.