ansaurus

Question

Answer 1

A:

If your code works on a smaller data set, it might just be that the default 64Mb that the JVM uses is not enough, does it work when you pass -Xmx512m as argument to the Java command line?

rsp 2010-10-03 08:29:04

Thanks I tried to pass -Xmx1024m and it worked, but the computing time is so long, it took 8mins to compute a file contains 211 lines of numbers. is there any other data structure or algorithm to do this?

starcaller 2010-10-03 18:22:00

@starcaller, you are creating a large dataset and compare derived information. You could also compare the source dataset and construct the difference dataset from this. This will take much less memory.

rsp 2010-10-04 08:04:57

Answer 2

+2 A:

Okay, suppose you have a Pair class as follows:

public class Pair {

    private final int value1;
    private final int value2;

    public Pair(int value1, int value2) {
        this.value1 = value1;
        this.value2 = value2;
    }

    public int value1() {
        return value1;
    }

    public int value2() {
        return value2;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + value1;
        result = prime * result + value2;
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Pair other = (Pair) obj;
        if (value1 != other.value1)
            return false;
        if (value2 != other.value2)
            return false;
        return true;
    }

    @Override
    public String toString() {
        return "(" + value1 + ", " + value2 + ")";
    }

}

Note that it's important to properly implement the equals(Object) and hashCode() methods if you expect instances of the Pair class to behave properly when used in hash-based data structures (for example: Hashtable, HashMap, HashSet, HashMultimap, HashMultiset).

Now, this code will read in a file ~~(requires the Guava libraries)~~:

    File file = ...;

    final Map<Pair, Collection<Integer>> lineNumbersByPair = new HashMap<Pair, Collection<Integer>>();

    /*
     * Step 1: Read in the lines, one by one.
     */
    Reader reader = new FileReader(file);
    try {
        BufferedReader bufferedReader = new BufferedReader(reader);
        try {
            String line;

            int lineNumber = 0;
            while ((line = bufferedReader.readLine()) != null) {
                lineNumber++;

                String[] tokens = line.split("\\s+");
                int[] values = new int[tokens.length];

                for (int i = 0; i < tokens.length; i++) {
                    values[i] = Integer.parseInt(tokens[i]);
                }

                for (int i = 0; i < values.length; i++) {
                    for (int j = i + 1; j < values.length; j++) {
                        Pair pair = new Pair(values[i], values[j]);

                        Collection<Integer> lineNumbers;
                        if (lineNumbersByPair.containsKey(pair)) {
                            lineNumbers = lineNumbersByPair.get(pair);
                        } else {
                            lineNumbers = new HashSet<Integer>();
                            lineNumbersByPair.put(pair, lineNumbers);
                        }
                        lineNumbers.add(lineNumber);
                    }
                }
            }
        } finally {
            bufferedReader.close();
        }
    } finally {
        reader.close();
    }

    /*
     * Step 2: Identify the unique pairs. Sort them according to how many lines they appear on (most number of lines to least number of lines).
     */
    List<Pair> pairs = new ArrayList<Pair>(lineNumbersByPair.keySet());
    Collections.sort(
            pairs,
            new Comparator<Pair>() {
                @Override
                public int compare(Pair pair1, Pair pair2) {
                    Integer count1 = lineNumbersByPair.get(pair1).size();
                    Integer count2 = lineNumbersByPair.get(pair2).size();
                    return count1.compareTo(count2);
                }
            }
        );
    Collections.reverse(pairs);

    /*
     * Step 3: Print the pairs and their line numbers.
     */
    for (Pair pair : pairs) {
        Collection<Integer> lineNumbers = lineNumbersByPair.get(pair);
        if (lineNumbers.size() > 1) {
            System.out.println(pair + " appears on the following lines: " + lineNumbers);
        }
    }

In a test, the code read in a file with 20,000 lines, each line containing 10 numbers ranging anywhere between 0 and 1000.

Adam Paynter 2010-10-03 21:29:30

thanks, but when I tried to compile the file, it threw an error on the syntax:`final Multimap<Pair, Integer> lineNumbersByPair = HashMultimap.create();` said can't find symbol

starcaller 2010-10-03 23:10:14

That's because the program relies on the Guava libraries (as mentioned in the answer). The libraries can be obtained from here: http://code.google.com/p/guava-libraries/downloads/list Unzip the file and look for a file named `guava-r07.jar`. Include the JAR file when you compile the program.

Adam Paynter 2010-10-03 23:13:06

I downloaded it and the apache collection lib, and have already did `import org.apache.commons.collections.BidiMap;import org.apache.commons.collections.Factory;import org.apache.commons.collections.MultiHashMap;import org.apache.commons.collections.MultiMap;import org.apache.commons.collections.bidimap.DualHashBidiMap;import org.apache.commons.collections.map.LazyMap;` what else should I import, Thanks

starcaller 2010-10-03 23:19:00

@starcaller: Okay, I have revised my answer to no longer depend on the Guava libraries. It should compile using only the standard Java library now.

Adam Paynter 2010-10-03 23:21:29

It is amazing to use the collection to do the job, the time efficiency improved so much. Thanks so much

starcaller 2010-10-03 23:47:12

@starcaller: You're welcome. I'm glad to help!

Adam Paynter 2010-10-04 11:24:25

@ Adam Paynter: Hi Adam, I got another problem, when I try to run a .dat file which is 14.7mb large, and I set memory for the JVM to 1500mb, but it give me an exception:OutOfMemoryErroe:Java heap space. when I tried to set higher memory for it, like 1600mb, it said it's invalid the specified size exceeds the maximum representable size.

starcaller 2010-10-07 23:04:50

@starcaller: I'm not sure how to properly configure the heap size. You may want to check for a question regarding that. If not, perhaps you should ask a separate question.

Adam Paynter 2010-10-07 23:13:14

@starcaller: Also, you may want to try reverting the code back to the version that uses the Guava libraries (that is, the version that uses `Multimap` and `HashMultimap`). The classes you need to import are `com.google.common.collect.Multimap` and `com.google.common.collect.HashMultimap`. This may prevent excessive numbers of collections from being created.

Adam Paynter 2010-10-07 23:15:06

@Adam: can you post the code using HashMultimap again please? I tried to write it but kind of getting lost.Thanks.

starcaller 2010-10-07 23:21:48

@starcaller: Check my answer's [revisions](http://stackoverflow.com/posts/3851674/revisions) (hopefully you have enough reputation to see them). It should be the first revision.

Adam Paynter 2010-10-07 23:24:02

@Adam: Thanks, now I have to add the google package to the path

starcaller 2010-10-07 23:48:41

@Adam: I wrote the code in JCreator, and it compiled well, but when i tried to run it in command line prompt, it gave out error can't find google package, why did this show up since I have add the path in JCreator

starcaller 2010-10-07 23:55:09

@starcaller: Java needs access to third party libraries both at *compile time* and at *run time*. By adding it to the path in JCreator, you have given Java access to the libraries at *compile time*. When you run the program, you must also make sure that Java has access to the same libraries. If you're running it via the command line, you will probably have to add the `-cp` command line argument. For example: `java -cp .;guava.jar your.awesome.Program` (the semicolon separates the paths - use a colon (`:`) if you're not on Windows)

Adam Paynter 2010-10-08 09:07:48

@Adam:I tried using the google package, but still, it give me an exception:OutOfMemoryErroe:Java heap space

starcaller 2010-10-13 04:36:07

@Adam: is there any better way to store the pairs?

starcaller 2010-10-13 05:01:32

@starcaller: I don't know a solution immediately off the top of my head (other than trying a heap space larger than `-Xmx1024m`). You *may* find some help from the [Colt library](http://acs.lbl.gov/software/colt/). It's meant for "High Performance Scientific and Technical Computing in Java".

Adam Paynter 2010-10-13 10:58:38

@Adam: if I don't need to track which line the pair appears and just record how many lines it appears, how should I modify the code? Thanks

starcaller 2010-10-13 14:54:22

@Adam: I am thinking create the pair collection at first based on the numbers in the file regardless to which line they are since the number is from 1 to 999, and then compare the pairs from each line to this premade collection, and update the int value which record how many time it appears. ` final Multimap<Pair, int> allPair = HashMultimap.create(); for(int i=0;i<999;i++) { for(int j=i+1;j<1000;j++) { Pair pair = new Pair(i, j); if(countItem[i]!=0 } }`

starcaller 2010-10-13 14:59:21

@Adam:I made it out by using premaking the map, and now, it cost 110s to read a 15m file. so happy, Thanks

starcaller 2010-10-13 15:39:29

ansaurus

tags:

views:

answers:

compare pairs stored in hashtable java

related questions