tags:

views:

83

answers:

5

In C, I want to process a file that contains 108 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?

+1  A: 

Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.

I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.

Jerry Coffin
Hmm, it's a bit more then 16 meg according to my math :)
Nikolai N Fetissov
16*10^8 is ~16 GB
Luther Blissett
Can you count people, for Turing sake?
Nikolai N Fetissov
@Luther, isn't it [1.5 GB](http://www.google.com/search?q=10^8+*+16+bytes)?
Matthew Flaschen
Sorry, yes 1.6. Dammit.
Luther Blissett
See [GHashTable](http://library.gnome.org/devel/glib/unstable/glib-Hash-Tables.html).
Matthew Flaschen
@Nikolai (etc.) Oops, yes, I got the number of zeros wrong. Of course I can't count -- that's why I use a computer! Nonetheless, it's a small enough amount that fitting it in memory is pretty trivial, so it's mostly a matter of parsing the input file (the format of which remains unclear).
Jerry Coffin
A: 

You'll need to sort the file.

Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.

Codo
I have to disagree. Sorting a n-element array is O(n log n), while filling a hash table or hash set is amortized O(n). On data of this size, that makes a practical difference.
Matthew Flaschen
@Matthew: However, we are talking about roughly 1.6GB here, and there may be limited space. Sorting the data could be done in place with just a little extra memory, while hashing is going to take a lot more data. If there's enough available memory (like a well-filled-out 64-bit system not under heavy memory load), do the hashing. Otherwise, sorting in place may well be faster than a solution that involves disk caching (or may not; log 1.6G is fairly big).
David Thornley
@David, that's a valid point. Constant factors (including swapping to disk) can always affect running times. But then again, `log2(10^8)` is 26. Anyway, it's certainly not true that you 'need' to sort it. Hashing at least requires consideration.
Matthew Flaschen
A: 

Take a library with set/map functions, e.g. see link text

@all I do not know any data structures i mean i know basics ........i want search each and every id whether it is unique or not from 10 million strings...............
venkat
+2  A: 

As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.

If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.

bta
+1  A: 

Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.

Mike