views:

50

answers:

4

I am scanning over a large amount of data and looking for common trends in it. Every time I meet a recurrence of a unit, I want to increment the count of it. What is the best data structure or way to hold this data. I need to be able to search it quickly, and also have a count with each unit of data.

+1  A: 

Dictionary/Hash table would be the best if you need fast look up.

Jesse Weigert
+4  A: 

You didn't specify a language, but a hash (associative array) is your best data structure.

It can sometimes be called a map/hashmap depending on a language (Java has HashMaps, Perl hash hashes, .

A hash/associative array/map data structure consists of a set of key-value pairs, with the values settable/gettable by the key. In you case, the key will be a string representing a word, a byte, or a double word (separate 3 hashmaps) and the value will be the count of the frequency.

DVK
+1  A: 

As has been mentioned, dictionaries/hash tables are your best bet. But your question is a little clear and I noticed that you mentioned compression in your tags; you may want to look at Huffman trees also.

kerkeslager
+1  A: 

As others have noted, a hash is an obvious candidate for your data structure.

For development and testing purposes, however, I would want that structure to be richer than a simple tally for each matched item. Rather, I would want store information that could be used to confirm the correctness of the code.

For starters, that information might contain the line number and some indication of the position where the match occurred. Here is an illustration in Perl:

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}
FM