tags:

views:

101

answers:

5

I am using a MSDOS to pipe in a file.. I am trying to write a program that counts how many times each word pair appears in a text file. A word pair consists of two consecutive words (i.e. a word and the word that directly follows it). In the first sentence of this paragraph, the words “counts” and “how” are a word pair.

What i want the program to do is, take this input :

abc def abc ghi abc def ghi jkl abc xyz abc abc abc ---

Should produce this output:

abc:
abc, 2
def, 2
ghi, 1
xyz, 1

def:
abc, 1
ghi, 1

ghi:
abc, 1
kl, 1

jkl:
abc, 1

xyz:
abc, 1

BTW: i am excluding "a", "the", "and" which has nothing to do with the word pair..

What is the best way to do this? please be nice, I am new to java.. this is what i have so far..

import java.util.Scanner;
import java.util.ArrayList;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.HashSet;

public class Project1
{
    public static void main(String[] args)
    {
        Scanner sc = new Scanner(System.in); 
        String word;
        String grab;
        int number;

        // ArrayList<String> a = new ArrayList<String>();
        // TreeSet<String> words = new TreeSet<String>();
        HashSet<String> uniqueWords = new HashSet<String>();

        System.out.println("project 1\n");

        while (sc.hasNext()) 
        {
            word = sc.next();
            word = word.toLowerCase();

            if (word.matches("a") || word.matches("and") || word.matches("the"))
            {
            }
            else
            {
                uniqueWords.add(word);
            }

            if (word.equals("---"))
            {
                break;
            }
        }

        System.out.println("size");
        System.out.println(uniqueWords.size());

        System.out.println("unique words");
        System.out.println(uniqueWords.size());

        System.out.println("\nbye...");
    }
}

Sorry about the formatting. Its hard to get it right in here...

A: 

does it have to be java? - this is really much more straightforward in perl

(also - is this a homework problem? :) )

phatmanace
+1  A: 

That code looks like a fragment of something which counts unique words, which isn't your problem. The structure I suggest you need is a Map whose key is a "word pair" (make a class for this) and whose value is the number of times that "word pair" appears in the input.

Paul Clapham
if i use something like this - Map<String, List<String>> words = new HashMap<String, List<String>>();how would i add to the hashmap? i understand it takes two arguments..string = keyand List = value? I am just not sure how to go about adding to that..if(word.contains("abc")) word.put("abc", ? ) something like that?
Corey
+3  A: 

What about using a Map:

Map<String, List<String>> words = new HashMap<String, List<String>>();

The keys in the map would be unique words, and the values would each be lists of words that followed that unique word. The data structure might look like:

Key    |    Value
--------------------------
abc    |    def, ghi, jkl
def    |    jkl, mno
darren
+1 for the nicely formatted table :)
Carl Smotricz
haha thanks, except my "def"'s turned out to be keywords :)
darren
@darren: That's because Python (among others) uses def as a keyword.
R. Bemrose
That looks like what i want to do.. How would i print out each key value? and possibly do a count of each?Im familiar with hashmap but not map. Or a combination of the twoThank you
Corey
I tried this: Map<String, List<String>> Uwords = new HashMap<String, List<String>>();it wont let me add or put - Uwords.add(word);or Uwords.put(word);how should i add to the hashmap?
Corey
You must keep track of what your map is mapping. To add stuff to this kind of map, you need to add a String as the key and a List as the value. The list can be empty at first, and then you can add to it as you find words that should be in it. e.g to add a blank list for abc Uwords.put("abc", new ArrayList<String>())
darren
couple of pesky sidenotes: use lowercase names for your variables. Also, if you are used to HashMap then you are used to Map. Map is just an interface for the specific HashMap implementation.
darren
Im still confused on how to add or find words to put in it. Do i do a if word.contains("abc") then Uwords.put("abc", new ArrayList<String>())i understand HashMap<String, List<String>>();this takes 2 arguments.. string = keyList = the value?do i have to do that with "def", "ghi" etc???I wish you could give me an example of what your talking about!also, Uwords.put("abc", new ArrayList<String>())ArrayList<String>() contains nothing? why we adding it?
Corey
I understand HashMap<String, List<String>>();takes 2 arguments.. string = they key.List = value? how do we add the value?
Corey
Keep in mind the .txt file im piping in isnt abc, def, format.. it contains regular words..
Corey
A: 

One possible approach would be to take your uniqueWords Set and wrap it in a List (to get direct access by index). You could then create a matrix of ints, think of it as a table that has all words in both the columns and the rows. Now run through your text and for each word, get the position for this word and it successor in the table, and count that up, something like:

table[words.indexOf(currentWord)][words.indexOf(nextWord)]++;

In the end your table will contain the frequencies of every word-word pair. Also, to find further help on your problem, it might help to search for bigrams, which is the common name for this problem.

Fabian Steeg
A: 

Various hints:

  • You could read a file in directly by using

    Scanner sc = new Scanner(new File("file.name"));

  • You could put your so-called "stop words", i.e. "a", "an", "the" into a Set, such as a java.util.HashSet, and then simply test for it by saying something simple like

    if (stopWords.contains(word)) ...

  • For the data structure: This is fairly sophisticated for a "project 1"! Given pairs of words in variables called first and second, I guess what I would use is a HashMap keyed on words in first, and containing as values a second HashMap keyed on words in second. The values of the second hashmap would be the counts for that pair of words, stored as Integer values.

  • You need to watch out for the corner case where you're seeing a second word for the first time; in that case, you need to store in the second hashmap your second word and Integer.valueOf(1). Otherwise, you need to replace the value with an Integer that's 1 bigger than the previous one.

  • There's a way you can "cheat" a little and dramatically simplify your data structure: If you "glue" your first and second words together using a separator character, e.g.

    String key = first + "_" + second;

then you have a key that contains both words, and you only need a single hashmap to store keys and counts in. However, this makes for a little work later on, when you'll have to have a collection of first words (hint: you can store those in a Set as you're prcoessing the input) and split those keys up again (hint: Use String.split(key, "_")).

If you want your words to be automatically sorted in ascending order, you'll probably do well to use TreeMap rather than HashMap.

Carl Smotricz