Determining string uniqueness in a large file

tags:

c
file-io

views:

answers:

+3 Q:

Determining string uniqueness in a large file

In C, I want to process a file that contains 10⁸ 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?

+1 A:

Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.

I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.

Jerry Coffin 2010-08-13 21:29:55

Hmm, it's a bit more then 16 meg according to my math :)

Nikolai N Fetissov 2010-08-13 21:35:50

16*10^8 is ~16 GB

Luther Blissett 2010-08-13 21:36:18

Can you count people, for Turing sake?

Nikolai N Fetissov 2010-08-13 21:38:09

@Luther, isn't it [1.5 GB](http://www.google.com/search?q=10^8+*+16+bytes)?

Matthew Flaschen 2010-08-13 21:40:27

Sorry, yes 1.6. Dammit.

Luther Blissett 2010-08-13 21:43:22

See [GHashTable](http://library.gnome.org/devel/glib/unstable/glib-Hash-Tables.html).

Matthew Flaschen 2010-08-13 21:43:39

@Nikolai (etc.) Oops, yes, I got the number of zeros wrong. Of course I can't count -- that's why I use a computer! Nonetheless, it's a small enough amount that fitting it in memory is pretty trivial, so it's mostly a matter of parsing the input file (the format of which remains unclear).

Jerry Coffin 2010-08-13 21:47:53

You'll need to sort the file.

Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.

Codo 2010-08-13 21:33:23

I have to disagree. Sorting a n-element array is O(n log n), while filling a hash table or hash set is amortized O(n). On data of this size, that makes a practical difference.

Matthew Flaschen 2010-08-13 21:41:38

@Matthew: However, we are talking about roughly 1.6GB here, and there may be limited space. Sorting the data could be done in place with just a little extra memory, while hashing is going to take a lot more data. If there's enough available memory (like a well-filled-out 64-bit system not under heavy memory load), do the hashing. Otherwise, sorting in place may well be faster than a solution that involves disk caching (or may not; log 1.6G is fairly big).

David Thornley 2010-08-13 22:24:22

@David, that's a valid point. Constant factors (including swapping to disk) can always affect running times. But then again, `log2(10^8)` is 26. Anyway, it's certainly not true that you 'need' to sort it. Hashing at least requires consideration.

Matthew Flaschen 2010-08-14 03:57:09

Take a library with set/map functions, e.g. see link text

2010-08-13 21:37:49

@all I do not know any data structures i mean i know basics ........i want search each and every id whether it is unique or not from 10 million strings...............

venkat 2010-08-14 18:45:40

+2 A:

As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.

If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.

bta 2010-08-13 21:39:54

+1 A:

Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.

Mike 2010-08-13 21:48:32

ansaurus

tags:

views:

answers:

Determining string uniqueness in a large file

related questions