ansaurus

Question

Collections framework to to count a file

Answer 1

A:

does it have to be java? - this is really much more straightforward in perl

(also - is this a homework problem? :) )

phatmanace 2010-01-13 21:24:34

Answer 2

+1 A:

That code looks like a fragment of something which counts unique words, which isn't your problem. The structure I suggest you need is a Map whose key is a "word pair" (make a class for this) and whose value is the number of times that "word pair" appears in the input.

Paul Clapham 2010-01-13 21:27:36

if i use something like this - Map<String, List<String>> words = new HashMap<String, List<String>>();how would i add to the hashmap? i understand it takes two arguments..string = keyand List = value? I am just not sure how to go about adding to that..if(word.contains("abc")) word.put("abc", ? ) something like that?

Corey 2010-01-13 23:18:09

Answer 3

+3 A:

What about using a Map:

Map<String, List<String>> words = new HashMap<String, List<String>>();

The keys in the map would be unique words, and the values would each be lists of words that followed that unique word. The data structure might look like:

Key    |    Value
--------------------------
abc    |    def, ghi, jkl
def    |    jkl, mno

darren 2010-01-13 21:28:27

+1 for the nicely formatted table :)

Carl Smotricz 2010-01-13 21:44:54

haha thanks, except my "def"'s turned out to be keywords :)

darren 2010-01-13 21:52:24

@darren: That's because Python (among others) uses def as a keyword.

R. Bemrose 2010-01-13 22:08:30

That looks like what i want to do.. How would i print out each key value? and possibly do a count of each?Im familiar with hashmap but not map. Or a combination of the twoThank you

Corey 2010-01-13 22:43:45

I tried this: Map<String, List<String>> Uwords = new HashMap<String, List<String>>();it wont let me add or put - Uwords.add(word);or Uwords.put(word);how should i add to the hashmap?

Corey 2010-01-13 22:54:49

You must keep track of what your map is mapping. To add stuff to this kind of map, you need to add a String as the key and a List as the value. The list can be empty at first, and then you can add to it as you find words that should be in it. e.g to add a blank list for abc Uwords.put("abc", new ArrayList<String>())

darren 2010-01-13 22:58:34

couple of pesky sidenotes: use lowercase names for your variables. Also, if you are used to HashMap then you are used to Map. Map is just an interface for the specific HashMap implementation.

darren 2010-01-13 22:59:53

Im still confused on how to add or find words to put in it. Do i do a if word.contains("abc") then Uwords.put("abc", new ArrayList<String>())i understand HashMap<String, List<String>>();this takes 2 arguments.. string = keyList = the value?do i have to do that with "def", "ghi" etc???I wish you could give me an example of what your talking about!also, Uwords.put("abc", new ArrayList<String>())ArrayList<String>() contains nothing? why we adding it?

Corey 2010-01-13 23:11:07

I understand HashMap<String, List<String>>();takes 2 arguments.. string = they key.List = value? how do we add the value?

Corey 2010-01-13 23:13:38

Keep in mind the .txt file im piping in isnt abc, def, format.. it contains regular words..

Corey 2010-01-13 23:21:02

Answer 4

A:

One possible approach would be to take your uniqueWords Set and wrap it in a List (to get direct access by index). You could then create a matrix of ints, think of it as a table that has all words in both the columns and the rows. Now run through your text and for each word, get the position for this word and it successor in the table, and count that up, something like:

table[words.indexOf(currentWord)][words.indexOf(nextWord)]++;

In the end your table will contain the frequencies of every word-word pair. Also, to find further help on your problem, it might help to search for bigrams, which is the common name for this problem.

Fabian Steeg 2010-01-13 21:30:38

Answer 5

A:

Various hints:

You could read a file in directly by using

Scanner sc = new Scanner(new File("file.name"));
You could put your so-called "stop words", i.e. "a", "an", "the" into a Set, such as a java.util.HashSet, and then simply test for it by saying something simple like

if (stopWords.contains(word)) ...
For the data structure: This is fairly sophisticated for a "project 1"! Given pairs of words in variables called first and second, I guess what I would use is a HashMap keyed on words in first, and containing as values a second HashMap keyed on words in second. The values of the second hashmap would be the counts for that pair of words, stored as Integer values.
You need to watch out for the corner case where you're seeing a second word for the first time; in that case, you need to store in the second hashmap your second word and Integer.valueOf(1). Otherwise, you need to replace the value with an Integer that's 1 bigger than the previous one.
There's a way you can "cheat" a little and dramatically simplify your data structure: If you "glue" your first and second words together using a separator character, e.g.

String key = first + "_" + second;

then you have a key that contains both words, and you only need a single hashmap to store keys and counts in. However, this makes for a little work later on, when you'll have to have a collection of first words (hint: you can store those in a Set as you're prcoessing the input) and split those keys up again (hint: Use String.split(key, "_")).

If you want your words to be automatically sorted in ascending order, you'll probably do well to use TreeMap rather than HashMap.

Carl Smotricz 2010-01-13 21:42:55

ansaurus

tags:

views:

answers:

Collections framework to to count a file

related questions