We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory.
We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate. Two other approaches i found in my googling:
1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store.
2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm]
Is there any other approach avaiable. Thanks.