I have written a function that checks if to files are duplicates or not. This function signature is:
int check_dup_memmap(char *f1_name, char *f2_name)
It returns:
- (-1) - If something went wrong;
- (0) - If the two files are similar;
- (+1) - If the two files are different;
The next step is to write a function that iterates through all the files in a certain directory,apply the previous function, and gives a report on every existing duplicates.
Initially I've thought to write a function that generates a file with all the filenames in a certain directory and then, read that file again and gain and compare every two files. Here is that version of the function, that gets all the filenames in a certain directory.
void *build_dir_tree(char *dirname, FILE *f)
{
DIR *cdir = NULL;
struct dirent *ent = NULL;
struct stat buf;
if(f == NULL){
fprintf(stderr, "NULL file submitted. [build_dir_tree].\n");
exit(-1);
}
if(dirname == NULL){
fprintf(stderr, "NULL dirname submitted. [build_dir_tree].\n");
exit(-1);
}
if((cdir = opendir(dirname)) == NULL){
char emsg[MFILE_LEN];
sprintf(emsg, "Cannot open dir: %s [build_dir_tree]\t",dirname);
perror(emsg);
}
chdir(dirname);
while ((ent = readdir(cdir)) != NULL) {
lstat(ent->d_name, &buf);
if (S_ISDIR(buf.st_mode)) {
if (strcmp(".", ent->d_name) == 0 ||
strcmp("..", ent->d_name) == 0) {
continue;
}
build_dir_tree(ent->d_name, f);
}
else{
fprintf(f, "/%s/%s\n",util_get_cwd(),ent->d_name);
}
}
chdir("..");
closedir(cdir);
}
Still I consider this approach a little inefficient, as I have to parse the file again and again.
In your opinion what are other approaches should I follow:
- Write a datastructure and hold the files instead of writing them in the file ? I think for a directory with a lot of files, the memory will become very fragmented.
- Hold all the filenames in auto-expanding array, so that I can easy access every file by their index, because they will in a contiguous memory location.
- Map this file in memory using mmap() ? But mmap may fail, as the file gets to big.
Any opinions on this. I want to choose the most efficient path, and access as few resources as possible. This is the requirement of the program...
EDIT: Is there a way to get the numbers of files in a certain directory, without iterating through it ?