views:

105

answers:

8

I have a list of 50,000 ID's in a flat file and need to remove any duplicate ID's. Is there any efficient/recommended algorithm for my problem?

Thanks.

+3  A: 

Read into a dictionary line by line, discarding duplicates. When all read, write out to a new file.

Mitch Wheat
A: 

i guess if you have large enough memory allowance, you can put all these ids in array

$array[$id] = $id;

this would automatically weed out the dupes.

Sabeen Malik
A: 

You can do:

file_put_contents($file,implode("\n",array_unique(file($file)));

How it works?

  • Read the file using function file which returns an array.
  • Get rid of the duplicate lines using array_unique
  • implode those unique lines with "\n" to get a string
  • write the string back to the file using file_put_contents

This solution assumes that you've got one ID per line in the flat file.

codaddict
+2  A: 

I've did some experiments once and the fastest solution I could get in PHP was by sorting the items and manually remove all the duplicate items.

If performance isn't that much of an issue for you (which I suspect, 50,000 is not that much) than you can use array_unique(): http://php.net/array_unique

WoLpH
I ran this and it completed it in about a second. I thought it would take longer. Thanks.
Jamie Redmond
@Jamie: The IO will probably dwarf the processing, because generating a unique set will have a max algorithmic complexity of n log n, which is somewhat fast, and most disk IO is very slow.
Merlyn Morgan-Graham
A: 

If you can use a terminal (or native unix execution), the easiest way: (assuming that there is nothing else in the file):

sort < ids.txt | uniq > filteredIds.txt

zebediah49
uniq will only work if the ids are consecutive and duplicates are next to each other.
thetaiko
Good point; I thought uniq auto-sorted:`sort < ids.txt | uniq > filteredIDS.txt`
zebediah49
+4  A: 

You can use the command line sort program to order and filter the list of ids. This is a very efficient program and scales well too.

sort -u ids.txt > filteredIds.txt
thetaiko
A: 

You can do it via array / array_unique, in this example i guess your ids are separated by line braks, if thats not the case just change it

$file = file_get_contents('/path/to/file.txt');
$array = explode("\n",$file);
$array = array_unique($array);
$file = implode("\n",$array);
file_put_contents('/path/to/file.txt',$file);
Hannes
Thanks for the complete answer. I selected an answer before I seen this one.
Jamie Redmond
@Jamie Redmond don't sweat it :)
Hannes
A: 

If you can just explode the contents of the file on a comma (or any delimiter), then array_unique will produce the least (and cleanest) code, otherwise if your are parsing the file going with the $array[$id] = $id is the fastest and cleanest solution.

gwagner