ansaurus

Question

Get n Number of words using regex in Java

Answer 1

A:

(See below the break for the next go at this. Leaving this top part here because of thought process...)

Based on my reading of the split() javadoc, I think I know what's going on.

You want to split the string based on whitespace, up to n times.

String [] m = s.split("\\b", nWords);

Then stitch them back together with token whitespace if you must:

StringBuffer strBuf = new StringBuffer();
for (int i = 0; i < nWords; i++) {
    strBuf.append(m[i]).append(" ");
}

Finally, chop that into five equal strings:

String [] out = new String[5];
String str = strBuf.toString();
int length = str.length();
int chopLength = length / 5;
for (int i = 0; i < 5; i++) {
    int startIndex = i * chopLength;
    out[i] = str.substring(startIndex, startIndex + choplength); 
}

It's late at night for me, so you might want to check that one yourself for correctness. I think I got it somewhere in the area code of correct.

OK, here's try number 3. Having run it through a debugger, I can verify that the only problem left is the integer math of slicing strings that aren't factors of 5 into five pieces, and how best to deal with the remaining characters.

It ain't pretty, but it works.

String[] sliceAndDiceNTimes(String victim, int slices, int wordLimit) {
    // Add one to the wordLimit here, because the rest of the input string
    // (past the number of times split() does its magic) will be in the last
    // array member
    String [] words = victim.split("\\s", wordLimit + 1);
    StringBuffer partialVictim = new StringBuffer();

    for (int i = 0; i < wordLimit; i++) {
        partialVictim.append(words[i]).append(' ');
    }

    String [] resultingSlices = new String[slices];
    String recycledVictim = partialVictim.toString().trim();
    int length = recycledVictim.length();
    int chopLength = length / slices;

    for (int i = 0; i < slices; i++) {
        int chopStartIdx = i * chopLength;
        resultingSlices[i] = recycledVictim.substring(chopStartIdx, chopStartIdx + chopLength);
    }

    return resultingSlices;
}

Important notes:

"\s" is the correct regex. Using \b ends up with lots of extra splits due to there being word boundaries at the beginning and end of words.
Added one to the number of times split runs, because the last array member in the String array is the remaining input string that wasn't split. You could also just split the entire string and just use the for loop as-is.
The integer division remainder is still an exercise left for the questioner. :-)

WineSoaked 2010-05-08 07:59:39

`\b+` doesn't make sense. You can never follow `\b` with another `\b`. You also need to escape it to `"\\b"` in Java if that's what you wanted.

polygenelubricants 2010-05-08 12:32:29

Good point, I've edited to reflect that.

WineSoaked 2010-05-08 16:58:05

This isn't working properly. I've tried a million things from these suggestions, and this is the closest to what I want, but unfortunately it doesn't work.

Aymon Fournier 2010-05-08 18:41:03

Answer 2

+2 A:

I think the simplest, and most efficient way, is to simply repeatedly find a "word":

Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(chapter);
while (m.find()) {
  String word = m.group();
  ...
}

You can vary the definition of "word" by modifying the regex. What I wrote just uses regex's notion of word characters, and I wonder if it might be more appropriate than what you're trying to do. But it won't for instance include quote characters, which you may need to allow within a word.

Sean Owen 2010-05-08 08:07:21

Alternatively you could also just detect whitespaces and go from there.

Esko 2010-05-08 08:08:51

Answer 3

A:

I'm just going to guess what you need here; hopefully this is close:

public static void main(String[] args) {
    String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
        "nisi ut aliquip ex ea commodo consequat. Rosebud.";

    String[] words = text.split("\\s+");
    final int N = words.length;
    final int C = 5;
    final int R = (N + C - 1) / C;
    for (int r = 0; r < R; r++) {
        for (int x = r, i = 0; (i < C) && (x < N); i++, x += R) {
            System.out.format("%-15s", words[x]);
        }
        System.out.println();
    }
}

This produces:

Lorem          sed            dolore         quis           ex             
ipsum          do             magna          nostrud        ea             
dolor          eiusmod        aliqua.        exercitation   commodo        
sit            tempor         Ut             ullamco        consequat.     
amet,          incididunt     enim           laboris        Rosebud.       
consectetur    ut             ad             nisi           
adipisicing    labore         minim          ut             
elit,          et             veniam,        aliquip

Another possible interpretation

This uses java.util.Scanner:

static String nextNwords(int n) {
    return "(\\S+\\s*){N}".replace("N", String.valueOf(n));
}   
static String[] splitFive(String text, final int N) {
    Scanner sc = new Scanner(text);
    String[] parts = new String[5];
    for (int r = 0; r < 5; r++) {
        parts[r] = sc.findInLine(nextNwords(N / 5 + (r < (N % 5) ? 1 : 0)));
    }
    return parts;
}
public static void main(String[] args) {
    String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
      "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
      "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
      "nisi ut aliquip ex ea commodo consequat. Rosebud.";

    for (String part : splitFive(text, 23)) {
        System.out.println(part);
    }
}

This prints the first 23 words of text,

Lorem ipsum dolor sit amet, 
consectetur adipisicing elit, sed do 
eiusmod tempor incididunt ut labore 
et dolore magna aliqua. Ut 
enim ad minim

Or if 7:

Lorem ipsum 
dolor sit 
amet, 
consectetur 
adipisicing

Or if 3:

Lorem 
ipsum 
dolor 
<blank>
<blank>

polygenelubricants 2010-05-08 12:24:03

+1 for the second interpretation--that's how I read the question (except the OP's definition of "word" could be clearer). But I would have used `findWithinHorizon(<regex>, 0)`, to span line breaks.

Alan Moore 2010-05-08 21:43:53

Answer 4

+2 A:

there is a better alternative made just for this using BreakIterator. That would be the most correct way to parse for words in Java.

fuzzy lollipop 2010-05-08 17:00:47

Can you please write the code using the BreakIterator to get a certain amount of words, and then divide that part into 5 smaller parts?

Aymon Fournier 2010-05-08 18:43:43

+1. The hard part of this problem is sifting individual words out of real-world prose, and BreakIterator is the best tool for that in the Java standard library. The rest just basic programming.

Alan Moore 2010-05-08 21:24:09

Answer 5

A:

I have a really really ugly solution:

public static Object[] getNumberWords(String s, int nWords, int offset){
    Object[] os = new Object[2];
    Pattern p = Pattern.compile("(\\w+)");
    Matcher m = p.matcher(s);
    m.region(offset, m.regionEnd());
    int wc = 0;
    String total = "";
    while (wc <= nWords && m.find()) {
      String word = m.group();
      total += word + " ";
      wc++;
    }
    os[0] = total;
    os[1] = total.lastIndexOf(" ") + offset;
    return os; }

    String foo(String s, int n){
    Object[] os = getNumberWords(s, n, 0);
    String a = (String) os[0];
    String m[] = new String[5];
    int indexCount = 0;
    int lastEndIndex = 0;
    for(int count = (n / 5); count <= n; count += (n/5)){
        if(a.length()<count){count = a.length();}
        os = getNumberWords(a, (n / 5), lastEndIndex);
        lastEndIndex = (Integer) os[1];
        m[indexCount] = (String) os[0];
        indexCount++;
    }
    return "Part One: \n" + m[0] + "\n\n" + 
    "Part Two: \n" + m[1] + "\n\n" + 
    "Part Three: \n" + m[2] + "\n\n" +
    "Part Four: \n" + m[3] + "\n\n" + 
    "Part Five: \n" + m[4];
}

Aymon Fournier 2010-05-08 19:07:24

Yeah, that's pretty ugly. I especially don't like the Object[] that you have to just know that os[0] is a String and os[1] is an Integer...

WineSoaked 2010-05-08 19:49:17

ansaurus

tags:

views:

answers:

Get n Number of words using regex in Java

Another possible interpretation

related questions