views:

65

answers:

2

Hi guys, I am doing an application that will compute all 2 size frequent itemset from a set of transactions. That is the application will have as input a data file (space delimited text file - with the items encoded as integers) and a percentage, given as an integer (e.g. input 2 represents 2%). The application will output in a distinct file each pair of numbers that appears together in the same transaction (a transaction is represented by one line in the file) in more than 2% of all transactions (where 2% is the percentage given in the input). The output file will contain each pair of items in a line together with their support (the number of transactions where they appear) also the application will output (on the screen on in a file) the duration (the time needed to execute the task).

the data file will be like

55 22 33 123 231 414

21 43 432 435 231 4324 534

22 21 33 123 231 534 666 222

...

each line is called a transaction and the input file contains thousands of transactions. I am thinking about using the data mining rule first to find all the single numbers whose appear frequency is larger than 2% in each transaction, and then form pairs for each transaction and at last compare each pair and generate the output file.

anyone has some ideas or code for this please help, if you have code(better in java) for this that will be very helpful Thanks a lot.

A: 

Do each transaction one at a time. For each transaction, find all numbers that are paired. Put them in a HashTable<Integer,Integer> with the number as the key and a value of 1. If there is already an entry, increment the value.

Once you have processed all transactions, go through the HashMap and look for values greater than 2% the total number of transactions. These are your winners.

They can be output directly to a file, or stored in another data structure for sorting first.

Erick Robertson
but how can you compare the pairs in the hashtable, I mean in your way, you are saying store all the pairs of numbers in a hashtable and set key value to indicate the appearence time
starcaller
I am thinking using a pair object to store the pairs generated from each transaction, in that way, you can easily say which pairs are identical and record the show time.but i am not sure how can do it in hash table, can you be more detailed please
starcaller
+1  A: 

Here's one way to count the integers.

public class IntCount {

    public static void main(String[] args) {
        count("123 234 456 678 789 234 234 123");

    }

    public static void count(String transactionLine) {
        String[] parts = transactionLine.split(" ");

        Map<String, Integer> hashTable = new HashMap<String, Integer>();
        // Count duplicates
        for (String s : parts) {
            if (hashTable.get(s) == null) hashTable.put(s, 1);
            else hashTable.put(s, hashTable.get(s) + 1);
        }

        for (String s : hashTable.keySet()) {
            System.out.println("s: " + s + " count: " + hashTable.get(s));
        }
    }
}

Now you can start working through determining the 2% part.

Tony Ennis
do you think it will be better to get all the numbers stored first and then stored all possible pairs in an array of pair objects? i have written a pair class to represent the object and it works fine for the test data, but i am not sure it is a efficient solution
starcaller
It sounds like we can do it one line/transaction at a time which will conserve resources. However, I can't say I understand the 2% part. If a transaction had 100 integers, and 3 of them were the same, would this be 3%?
Tony Ennis