tags:

views:

524

answers:

3

Hi All,

I am new to java regex.Please help me. Consider the below paragraph,

Paragraph :

            Name abc
            sadghsagh
            hsajdjah Name
            ggggggggg
            !!!
            Name ggg
            dfdfddfdf Name
            !!!
            Name hhhh
            sahdgashdg Name
            asjdhjasdh
            sadasldkalskd
            asdjhakjsdhja
            !!!

i need to split the above paragraph as blocks of text starting with Name and ending with !!! . Here I dont want to use !!! as the only delimiter to split the paragraph.I need to include the starting sequence (Name) also in my regex.

ie., my result api should looks like SplitAsBlocks("Paragraph","startswith Name","endswith !!!")

How to achieve this ,please anyone help me ...

Now i want the same output as Brito given ...but here i have added Name after "hsajdjah".Here it split the text as beow :

Name
ggggggggg
!!!

but i need

Name abc
sadghsagh
hsajdjah Name
ggggggggg
!!!

that is i have to match up Name which is at the starting of the line ,not in the middle .

please suggest me ...

Bart ...see the below input case for your code ...

i need to split the following using ur API with parameter start => Name and end => ! But the output varies ..i have only 3 blocks starts with Name and ends with ! . i have attached the output also .

String myInput =    "Name hhhhh class0"+ "\n"+
                     "HHHHHHHHHHHHHHHHHH"+ "\n"+
                     "!"+ "\n"+
                     "Name TTTTT TTTT"+ "\n"+
                     "GGGGGG UUUUU IIII"+ "\n"+
                     "!"+ "\n"+
                     "Name JJJJJ WWWW"+ "\n"+
                     "IIIIIIIIIIIIIIIIIIIII"+ "\n"+
                     "!"+ "\n"+
                     "RRRRRRRRRRR TTTTTTTT"+ "\n"+
                     "HHHHHH"+ "\n"+
                     "JJJJJ 1 Name class1"+ "\n"+
                     "LLLLL 5 Name class5"+ "\n"+
                     "!"+ "\n"+
                     "OOOOOO HHHH FFFFFF"+ "\n"+
                     "service 0 Name class12"+ "\n"+
                     "!"+ "\n"+
                     "JJJJJ YYYYYY 3/0"+ "\n"+
                     "KKKKKKK"+ "\n"+
                     "UUU UUU UUUUU"+ "\n"+
                     "QQQQQQQ"+ "\n"+
                         "!";
    String[] tokens = tokenize(myInput, "Name", "!");
    int n = 0;
    for(String t : tokens) {
        System.out.println("---------------------------\n"+(++n)+"\n"+t);
    }

OutPut :

---------------------------
1
Name hhhhh class0
HHHHHHHHHHHHHHHHHH
!
---------------------------
2
Name TTTTT TTTT
GGGGGG UUUUU IIII
!
---------------------------
3
Name JJJJJ WWWW
IIIIIIIIIIIIIIIIIIIII
!
---------------------------
4
Name class1
LLLLL 5 Name class5
!
---------------------------
5
Name class12
!

Here i need to have only the Name at the starting of the line not at the middle ... How to add regex for this ...

+2  A: 

Try:

import java.util.*;
import java.util.regex.*;

public class Main { 

    public static String[] tokenize(String text, String start, String end) {
        // old line:
        //Pattern p = Pattern.compile("(?s)"+Pattern.quote(start)+".*?"+Pattern.quote(end));
        // new line:
        Pattern p = Pattern.compile("(?sm)^"+Pattern.quote(start)+".*?"+Pattern.quote(end)+"$");

        Matcher m = p.matcher(text);
        List<String> tokens = new ArrayList<String>();
        while(m.find()) {
            tokens.add(m.group());
        }
        return tokens.toArray(new String[]{});
    }

    public static void main(String[] args) {
        String text = "Name abc" + "\n" +
            "sadghsagh"          + "\n" +
            "hsajdjah Name"      + "\n" +
            "ggggggggg"          + "\n" +
            "!!!"                + "\n" +
            "Name ggg"           + "\n" +
            "dfdfddfdf Name"     + "\n" +
            "!!!"                + "\n" +
            "Name hhhh"          + "\n" +
            "sahdgashdg Name"    + "\n" +
            "asjdhjasdh"         + "\n" +
            "sadasldkalskd"      + "\n" +
            "asdjhakjsdhja"      + "\n" +
            "!!!";
        String[] tokens = tokenize(text, "Name", "!!!");
        int n = 0;
        for(String t : tokens) {
            System.out.println("---------------------------\n"+(++n)+"\n"+t);
        }
    }
}
Bart Kiers
Shouldn't that be `/Name|!!!/`?
Kobi
Thanks Bart ..i need Name and !!! within my final string.Also my condition is to use both Name and !!! to split the string not to use either Name or !!! .
Sidharth
@Kobi: no, in Java you don't use delimiters around your regex. Don't confuse Java with JavaScript in this matter!
Bart Kiers
@OP: see the edit.
Bart Kiers
Bart really thanks for your concerns ... you defined text as single line as follows ...String text = "Name abc sadghsagh hsajdjah !!! Name ggg dfdfddfdf !!! Name hhhh sahdgashdg asjdhjasdh sadasldkalskd asdjhakjsdhja !!!";But in my case each word is in a separate line ..please suggest me how to split this type of paragraph?
Sidharth
It works for strings with line breaks as well: try it.
Bart Kiers
Did you mean: `return tokens.toArray(new String[tokens.size()])`?
rsp
@Bart - sorry, of course Java doesn't support that. I was confused, and sure this was JavaScript. Sorry, don't know how that happened...
Kobi
Thanks All sfussenegger,Kobi ...special thanks to Bart ..Bart your code works fine with new line characters too .. again Thanks a lot Bart ...
Sidharth
No problem Kobi. The JavaScript syntax looks a lot like Java's, so the mistake is easily made.
Bart Kiers
@rsp. No, I meant what I posted. Try it and see for yourself it works.
Bart Kiers
You're welcome `unknown (google)`.
Bart Kiers
Hi Bart .. one more doubt here ...i have a situation where the "Name" delimiter may be inside the script but i want to ignore this kind of Name for example my paragraph looks like ............ Name abc sadghsagh hsajdjah Name ---- here i want to ignore this Name !!! Name ggg dfdfddfdf !!! Name hhhh sahdgashdg asjdhjasdh sadasldkalskd asdjhakjsdhja !!!
Sidharth
When using my suggestion, the String `Name abc sadghsagh hsajdjah Name ---- here i want to gnore this Name !!! Name ggg dfdfddfdf !!! Name hhhh sahdgashdg sjdhjasdh sadasldkalskd asdjhakjsdhja !!!` will be split in three parts: *1*: `Name abc sadghsagh hsajdjah Name ---- here i want to gnore this Name !!!`, *2*: `Name ggg dfdfddfdf !!!` and *3*: `Name hhhh sahdgashdg sjdhjasdh sadasldkalskd asdjhakjsdhja !!!`, so I don't see any problems with that. If you want different output, please edit your original question accordingly. Thanks.
Bart Kiers
sure i am doing ...
Sidharth
+2  A: 
String s = "Name abc sadghsagh hsajdjah !!! Name ggg dfdfddfdf !!! Name hhhh sahdgashdg asjdhjasdh sadasldkalskd asdjhakjsdhja !!!!! ";
String startsWith = "Name";
String endsWith = "!!!";

// non-greedily get all groups starting with Name and ending with !!!
String pattern = String.format("(%s).*?(%s)", Pattern.quote(startsWith), Pattern.quote(endsWith));
System.out.println(pattern);

Matcher m = Pattern.compile(pattern, Pattern.DOTALL).matcher(s);
while (m.find()) 
  System.out.println(m.group());

output:

(\QName\E).*?(\Q!!!\E)
Name abc sadghsagh hsajdjah !!!
Name ggg dfdfddfdf !!!
Name hhhh sahdgashdg asjdhjasdh sadasldkalskd asdjhakjsdhja !!!
sfussenegger
Note that by default, the DOT does not match line breaks. Also, if the substrings the OP wants to "split" on contains regex meta characters, things will go wrong. Lastly, `group()` is the same as `group(0)`, but that's just eye-candy.
Bart Kiers
Thanks Sfussenegger ....The above solution works fine if String s is single line ...But in my case it is a paragraph ie., after each word there is a new line character ... please how to split this ..help me ..
Sidharth
you could add flag the s (single-line-mode) to have newlines be matched by the Dotall Operator (.) the relevant line would be Matcher m = Pattern.compile(pattern, Pattern.DOTALL).matcher(s);
squiddle
Special characters shouldn't be a problem as I've suggested using `Pattern.quote(startsWith)` that takes care of special characters. Additionally, I've edited my code to use `group()` and `Pattern.DOTALL`.
sfussenegger
I posted the remark about special characters before you edited your answer. Your first answer did not contain `Pattern.quote(...)`'s.
Bart Kiers
A: 

The following should also do if you want to keep both Name and !!! in the results.

String[] parts = string.split("(?=(Name|!!!))");

Edit: here's the corrected version:

String[] parts = string.split("(?<=!!!)\\s*(?=Name)");

This will split on any whitespace between !!! and Name and nothing else; hereby keeping the both parts. If you don't want to split on !!!Name, then replace \\s* by \\s+ to allow a one-to-many match instead of zero-to-many match.

Edit2: attached an example of the input/output. Input is copied from the topicstart:

String string = "Name hhhhh class0" + "\n" + "HHHHHHHHHHHHHHHHHH" + "\n" + "!" + "\n"
    + "Name TTTTT TTTT" + "\n" + "GGGGGG UUUUU IIII" + "\n" + "!" + "\n"
    + "Name JJJJJ WWWW" + "\n" + "IIIIIIIIIIIIIIIIIIIII" + "\n" + "!" + "\n"
    + "RRRRRRRRRRR TTTTTTTT" + "\n" + "HHHHHH" + "\n" + "JJJJJ 1 Name class1" + "\n"
    + "LLLLL 5 Name class5" + "\n" + "!" + "\n" + "OOOOOO HHHH FFFFFF" + "\n"
    + "service 0 Name class12" + "\n" + "!" + "\n" + "JJJJJ YYYYYY 3/0" + "\n" + "KKKKKKK"
    + "\n" + "UUU UUU UUUUU" + "\n" + "QQQQQQQ" + "\n" + "!";

String[] parts = string.split("(?<=!)\\s*(?=Name)");
for (String part : parts) {
    System.out.println(part);
    System.out.println("---------------------------------");
}

Output:

Name hhhhh class0
HHHHHHHHHHHHHHHHHH
!
---------------------------------
Name TTTTT TTTT
GGGGGG UUUUU IIII
!
---------------------------------
Name JJJJJ WWWW
IIIIIIIIIIIIIIIIIIIII
!
RRRRRRRRRRR TTTTTTTT
HHHHHH
JJJJJ 1 Name class1
LLLLL 5 Name class5
!
OOOOOO HHHH FFFFFF
service 0 Name class12
!
JJJJJ YYYYYY 3/0
KKKKKKK
UUU UUU UUUUU
QQQQQQQ
!
---------------------------------

Looks fine?

BalusC
Hi BalusC , here you are using or condition ... but i need to use and condition ...
Sidharth
Sorry, I missed this crucial point. I'll update the answer asap.
BalusC
Thanks BalusC ,i have tested your trick but here please test your code with the above input which i explained in my question.here the output differs ..please see that ..
Sidharth
Works fine here?
BalusC
ya but see the above 3rd output which is wrong in my case ...ie., i need to break the first ! mark itself but in above case it shows all the lines upto the end ... please execute Bart's code and compare with your output ... Thanks BalusC
Sidharth
So .. even if the text potentially contains a "!", e.g. "class!0", the code should still break on it? That doesn't seem logical to me. You'd to specify the requirements more clear. I.e. does the "!" always preceed by a linebreak? And so on. You have to take everything into account.
BalusC