ansaurus

Question

How to compare 2 files lexicographically using C

Answer 1

+1 A:

Are you allowed to use strcmp?

If so (untested):

int ret = 0;
while (ret == 0)
{ 
    char line1 [ MAX_LINE_LEN ]; 
    char line2 [ MAX_LINE_LEN ]; 
    if (fgets(line1, MAX_LINE_LEN, file1) != NULL )
    {
        if (fgets(line2, MAX_LINE_LEN, file2) != NULL )
        {
            ret = strcmp(line1, line2);
        }
        else
        {
            ret = 1;
        }
    }
    else
    {
        if (fgets(line2, MAX_LINE_LEN, file2) != NULL )
        {
            ret = -1;
        }
        else
        {
            break;
        }
    }
}
return ret;

Vicky 2010-09-28 10:07:08

I think he means byte for byte comparison.

zvrba 2010-09-28 10:24:06

I have now tested this and it does work for the test cases I tried.

Vicky 2010-09-28 10:26:19

@zvrba: That's what the strcmp line does, if one file didn't terminate before the other one.

Vicky 2010-09-28 10:30:16

you're almost there; replace `fgets(line,LEN,file)` with `fread(buf,1,LEN,file)`, and use `memcmp(buf1,buf2,LEN)` instead of `strcmp()`. ah, don't forget to zero fill buffers before reading, or at least addind a zero at the position returned by `fread()`.

Javier 2010-09-28 10:31:08

using line-based buffers, means that a short line would compare as 'less' than a longer line, even if it has a 0x03 where the other line has the newline. your approach is valid only for pure ASCII files

Javier 2010-09-28 10:33:05

@Javier, fair point. I kind of assumed that "lexicographically" implied ASCII but I guess it doesn't have to.

Vicky 2010-09-28 11:09:30

Yup I'm allowed to use strcmp, and also for the scope of this function "lexicographically" does mean ASCII only. Sorry for the confusion. Thanks for showing the alternative though Javier. Anyways Vicky, what were the test files you used? Cause I used the one found in the link I posted and your function doesn't work. The first problem is that strcmp checks a string at a time whereas your function assumed that it's gonna check the entire string. Secondly there's still the problem where one file is lexicographically less but yet has a longer first line, making it seem longer to the function.

jon2512chua 2010-09-28 18:20:31

@jon2512chua: I created some test files that exercised various aspects of the issue but didn't keep them around, so I can't tell you exactly what I tested. However I've just checked and it works fine for me when file1 contains a longer but lexicographically previous string than file2 in their first lines. `strcmp` checks the strings character by character so at the first difference it returns -1 in that scenario.

Vicky 2010-09-28 20:40:45

@jon2512chua: As a specific example I created File1: `ABCDEF\r\nXYZ` and File2: `RST\r\nXYZ`. This returned -1 as I expected. `strcmp` would have bailed out after checking the first character - A < R, so return -1.

Vicky 2010-09-28 20:42:18

@jon2512chua: I also just tested it with the example files in the tutorial you linked to, and it works for me!

Vicky 2010-09-28 20:48:06

Oh it works now, I misunderstood the function strcmp to terminate at a white space due to previous experience with C++. Sorry for the trouble and thanks for your help.

jon2512chua 2010-09-29 12:08:49

Answer 2

+2 A:

Take a look the source code of the UNIX cmp utility, e.g. here. The relevant file is regular.c. If you can't use mmap, the principle of implementation through fgetc() is the same: keep reading a single character from each of the two files as long as they compare the same. When (if!) you find a difference, return the result of the comparison. The borderline case of one file being proper prefix of the other (e.g. "ABC" "ABCCC") can be resolved by treating EOF as an infinitely small value. This is already neatly solved in C as fgetc() guarantees to return a negative value ONLY on EOF; proper characters are >= 0.

zvrba 2010-09-28 10:29:38

Hey sorry but I don't think I totally understood your explanation. But from what I can gather you're saying to step through one character at a time for each file and compare them, and that not to worry about EOF as it is already handled by fgetc which returns a negative value when it happens right? If that's the case it's rather similar to what I've done just that it doesn't solve the problem when a file is lexicographically smaller but have a longer first line. Sorry if I've interpreted your explanation wrong in any part.

jon2512chua 2010-09-28 18:14:15

ansaurus

tags:

views:

answers:

How to compare 2 files lexicographically using C

related questions