I have a list of 50,000 ID's in a flat file and need to remove any duplicate ID's. Is there any efficient/recommended algorithm for my problem?
Thanks.
I have a list of 50,000 ID's in a flat file and need to remove any duplicate ID's. Is there any efficient/recommended algorithm for my problem?
Thanks.
Read into a dictionary line by line, discarding duplicates. When all read, write out to a new file.
i guess if you have large enough memory allowance, you can put all these ids in array
$array[$id] = $id;
this would automatically weed out the dupes.
You can do:
file_put_contents($file,implode("\n",array_unique(file($file)));
How it works?
file
which returns an array.array_unique
file_put_contents
This solution assumes that you've got one ID per line in the flat file.
I've did some experiments once and the fastest solution I could get in PHP was by sorting the items and manually remove all the duplicate items.
If performance isn't that much of an issue for you (which I suspect, 50,000 is not that much) than you can use array_unique()
: http://php.net/array_unique
If you can use a terminal (or native unix execution), the easiest way: (assuming that there is nothing else in the file):
sort < ids.txt | uniq > filteredIds.txt
You can use the command line sort
program to order and filter the list of ids. This is a very efficient program and scales well too.
sort -u ids.txt > filteredIds.txt
You can do it via array / array_unique, in this example i guess your ids are separated by line braks, if thats not the case just change it
$file = file_get_contents('/path/to/file.txt');
$array = explode("\n",$file);
$array = array_unique($array);
$file = implode("\n",$array);
file_put_contents('/path/to/file.txt',$file);
If you can just explode the contents of the file on a comma (or any delimiter), then array_unique will produce the least (and cleanest) code, otherwise if your are parsing the file going with the $array[$id] = $id is the fastest and cleanest solution.