tags:

views:

79

answers:

2

Hey guys, I'm currently trying to implement a function using C that takes in two file names as command line arguments and compare them lexicographically.

The function will return -1 if the contents of the first file are less than the contents of the second file, 1 if the contents of the second file are less than the contents of the first file, and 0 if the files are identical.

Please give me some advice on how I should start with this.

[EDIT]

Hey guys sorry if there's any unclear part in the question, so I'll just post the link to the question here: Original question. Thing is it's an uni assignment so we're expected to do it using only basic C properties, probably only including stdio.h, stdlib.h, and string.h. Sorry for the trouble caused. Also here's the code I already have, my main problem now is that the function doesn't know that file1.txt (refer to the link) has it's first line longer than file2.txt, but is actually lexicographically less:

int filecmp(char firstFile[], char secondFile[])
{
    int similarity = 0;
    FILE *file1 = fopen(firstFile, "r");
    FILE *file2 = fopen(secondFile, "r");
    char line1[BUFSIZ];
    char line2[BUFSIZ];

    while (similarity == 0)
    {
        if (fgets(line1, sizeof line1, file1) != NULL)
        {
            if (fgets(line2, sizeof line2, file2) != NULL)
            {
                int length;

                if (strlen(line1) > strlen(line2))
                {
                    length = strlen(line1);
                }
                else
                {
                    length = strlen(line2);
                }

                for (int i = 0; i < length; i++)
                {
                    if (line1[i] < line2[i]) similarity = -1;
                    if (line1[i] > line2[i]) similarity = 1;
                }
            }
            else
            {
                similarity = 1; //As file2 is empty
            }
        }
        else
        {
            if (fgets(line2, sizeof line2, file2) != NULL)
            {
                similarity = -1; // As file1 is empty
            }
            else break;
        }
    }

    fclose(file1);
    fclose(file2);

    return similarity;
}

[END EDIT]

Many thanks,
Jonathan Chua

+1  A: 

Are you allowed to use strcmp?

If so (untested):

int ret = 0;
while (ret == 0)
{ 
    char line1 [ MAX_LINE_LEN ]; 
    char line2 [ MAX_LINE_LEN ]; 
    if (fgets(line1, MAX_LINE_LEN, file1) != NULL )
    {
        if (fgets(line2, MAX_LINE_LEN, file2) != NULL )
        {
            ret = strcmp(line1, line2);
        }
        else
        {
            ret = 1;
        }
    }
    else
    {
        if (fgets(line2, MAX_LINE_LEN, file2) != NULL )
        {
            ret = -1;
        }
        else
        {
            break;
        }
    }
}
return ret;
Vicky
I think he means byte for byte comparison.
zvrba
I have now tested this and it does work for the test cases I tried.
Vicky
@zvrba: That's what the strcmp line does, if one file didn't terminate before the other one.
Vicky
you're almost there; replace `fgets(line,LEN,file)` with `fread(buf,1,LEN,file)`, and use `memcmp(buf1,buf2,LEN)` instead of `strcmp()`. ah, don't forget to zero fill buffers before reading, or at least addind a zero at the position returned by `fread()`.
Javier
using line-based buffers, means that a short line would compare as 'less' than a longer line, even if it has a 0x03 where the other line has the newline. your approach is valid only for pure ASCII files
Javier
@Javier, fair point. I kind of assumed that "lexicographically" implied ASCII but I guess it doesn't have to.
Vicky
Yup I'm allowed to use strcmp, and also for the scope of this function "lexicographically" does mean ASCII only. Sorry for the confusion. Thanks for showing the alternative though Javier. Anyways Vicky, what were the test files you used? Cause I used the one found in the link I posted and your function doesn't work. The first problem is that strcmp checks a string at a time whereas your function assumed that it's gonna check the entire string. Secondly there's still the problem where one file is lexicographically less but yet has a longer first line, making it seem longer to the function.
jon2512chua
@jon2512chua: I created some test files that exercised various aspects of the issue but didn't keep them around, so I can't tell you exactly what I tested. However I've just checked and it works fine for me when file1 contains a longer but lexicographically previous string than file2 in their first lines. `strcmp` checks the strings character by character so at the first difference it returns -1 in that scenario.
Vicky
@jon2512chua: As a specific example I created File1: `ABCDEF\r\nXYZ` and File2: `RST\r\nXYZ`. This returned -1 as I expected. `strcmp` would have bailed out after checking the first character - A < R, so return -1.
Vicky
@jon2512chua: I also just tested it with the example files in the tutorial you linked to, and it works for me!
Vicky
Oh it works now, I misunderstood the function strcmp to terminate at a white space due to previous experience with C++. Sorry for the trouble and thanks for your help.
jon2512chua
+2  A: 

Take a look the source code of the UNIX cmp utility, e.g. here. The relevant file is regular.c. If you can't use mmap, the principle of implementation through fgetc() is the same: keep reading a single character from each of the two files as long as they compare the same. When (if!) you find a difference, return the result of the comparison. The borderline case of one file being proper prefix of the other (e.g. "ABC" "ABCCC") can be resolved by treating EOF as an infinitely small value. This is already neatly solved in C as fgetc() guarantees to return a negative value ONLY on EOF; proper characters are >= 0.

zvrba
Hey sorry but I don't think I totally understood your explanation. But from what I can gather you're saying to step through one character at a time for each file and compare them, and that not to worry about EOF as it is already handled by fgetc which returns a negative value when it happens right? If that's the case it's rather similar to what I've done just that it doesn't solve the problem when a file is lexicographically smaller but have a longer first line. Sorry if I've interpreted your explanation wrong in any part.
jon2512chua