tags:

views:

79

answers:

2

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
} 

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

+1  A: 

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

John Kugelman
+3  A: 

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

Kobi
That helped out a lot. Is there a way to check for things that are different case? Like "test Test"?
Crystal
@Crystal - Thanks! You can add `(?i)` at the beginning of the regex to make it case-insensitive, it seems like the standard solution for `replaceAll`.
Kobi
Another question Kobi if you have a second, if I am looping through an Arraylist that has my lines of words from a test file, and if I did a foreach loop to go through it, like for (String s: lineOfWords) { s = s.replaceAll..., then how would I add this new "s" to my new ArrayList to return. I think it has to do with shallow vs deep copy, but not sure. I tried pseudo-coding in my initial question above. Thx!
Crystal