In C, I want to process a file that contains 108 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?
Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.
I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.
You'll need to sort the file.
Just load it into a single memory block, run qsort
from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.
As other people have said, the most straightforward method is to simply load the entire file and use something like qsort
to sort it.
If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A
. Sort those and find the unique lines. For the next pass, load all the lines that start with B
, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.
Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.