tags:

views:

99

answers:

5

How to count the number of times each word appear in a String in Java using Regular Expression?

A: 

Must you use a regex? If not this might help:

public static int count(final String string, final String substring)
  {
     int count = 0;
     int idx = 0;

     while ((idx = string.indexOf(substring, idx)) != -1)
     {
        idx++;
        count++;
     }

     return count;
  }
fredley
That would count two `abba`'s in the string `abbabba`, which can't be correct, IMO.
Bart Kiers
my actual requirement is "hi hi this"hi --->2this --> 1they are separate words..
rgksugan
Make it `idx += substring.length` to fix the abbabba issue. To match whole words: does indexOf take a regex?
Amarghosh
+6  A: 

I don't think a regex can solve your problem completely.

You want to

  1. split a string into words, a regular expression can do this for a very simple definition of word, "parts of a string seperated by whitespace or punctuation", which is not a very good definition even if you just stick to English text

  2. Count the number of occurances of each word derived from step 1. To do that you must store some kind of Mapping, and regexes neither store nor count.

A workable approach could be to

  • split the inputstring (by either regex or other means) into an array of word-strings
  • iterate over the array, and building a Map to keep count of each word
  • iterate over the map to output a list of words and the number of occurances.

If your input is limited to English you still have to consider how you want your algorithm to behave in case of things like they're <->they are etc and compound words. Add other languages to the mix for additional kinds of headaches (different ways of writing the same word, words split into parts, difference in writing depending on where in a sentence the word occurs, etc)

alfirin
+1 for also mentioning the linguistic issue, which is indeed a little complex.
Neil Coffey
+1  A: 

I would split your task into a) identify words and b) count number of each unique word in text.

a) could be solved with splitting the text with a regex. b) could be solved by building a map with the result from a).

String text = "I like good mules. Mules are good :)";
String[] words = text.split("([\\W\\s]+)");
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word: words) {
    if (counts.containsKey(word)) {
        counts.put(word, counts.get(word) + 1);
    } else {
        counts.put(word, 1);
    }
}

result: {Mules=1, are=1, good=2, mules=1, like=1, I=1}

deadsven
`\W` also matches `\s`: so there's no need to include `\s` in the character set.
Bart Kiers
A: 
Pattern p = Pattern.compile("\\babba\\b");
Matcher m = p.matcher("abba is abba with abbabba and abba doing abba");
int count = 0;
while(m.find()){
    count++;
}
System.out.println(count); //4
Amarghosh
A: 
Michael D