tags:

views:

81

answers:

3

I am attempting to parse a text (CSS) file using fscanf and pull out all statements that match this pattern:

@import "some/file/somewhere.css";

To do this, I have the following loop set up:

FILE *file = fopen(pathToSomeFile, "r");
char *buffer = (char *)malloc(sizeof(char) * 9000);

while(!feof(file))
{
    // %*[^@] : Read and discard all characters up to a '@'
    // %8999[^;] : Read up to 8999 characters starting at '@' to a ';'.
    if(fscanf(file, "%*[^@] %8999[^;]", buffer) == 1)
    {
        // Do stuff with the matching characters here.
        // This code is long and not relevant to the question.
    }
}

This works perfectly SO LONG AS the VERY FIRST character in the file is not a '@'. (Literally, a single space before the first '@' character in the CSS file will make the code run fine.)

But if the very first character in the CSS file is a '@', then what I see in the debugger is an infinite loop -- execution enters the while loop, hits the fscanf statement, but does not enter the 'if' statement (fscanf fails), and then continues through the loop forever.

I believe my fscanf formatters may need some tweaking, but am unsure how to proceed. Any suggestions or explanations for why this is happening?

Thank you.

+2  A: 

I'm not an expert on scanf pattern syntax, but my interpretation of yours is:

  • Match a non-empty sequence of non-'@' characters, then
  • Match a non-empty sequence of up to 8999 non-';' characters

So yes, if your string starts with a '@', then the first part will fail.

I think if you start your format string with some whitespace, then fscanf will eat any leading whitespace in your data string, i.e. simply " %8999[^;]".

Oli Charlesworth
You are correct - `" %8999[^;]"` will cause leading whitespace to be discarded.
caf
Oli, thanks. I believe this is already captured in the current formatter: "%*[^@] %8999[^;]" On a file that does not start with a '@', fscanf properly discards whitespace.
Bryan
@Bryan: Indeed. But your current formatter doesn't work with a line that does start with `'@'`, whereas my suggestion does!
Oli Charlesworth
+1  A: 

Oli already said why fscanf failed. And since failure is a normal state for fscanf your busy loop is not the consequence of the fscanf failure but of the missing handling for it.

You have to handle a fscanf failure even if your format would be correct (in your special case), because you cannot be sure that the input always is matchable by the format. Actually you can be sure that much more nonmatching input exists than matching input.

Tilo Prütz
+1: This is a very good point. `fscanf` is not what you want unless you can *absolutely* guarantee that your input files are 100% well-formed. You're better off with a custom parser.
Oli Charlesworth
Thanks Tilo. I know I can handle the failure. But my trouble was that the input-file IS 100% well-formed; it just fails when there are zero characters before the first one I'm looking for with fscanf (the '@' symbol).
Bryan
@Bryan: The point of my answer (which not answered your original question ;)) was, that if you use fscanf in a loop like the one in your question, then you not only “can” handle the error but you “must” handle it. Because – like the reality has shown – you cannot guarantee that fscanf does not fail even if you think that it should not fail on your input. If you had understood this already at the time you asked the question, then a question like “Why fails this fscanf on that input line” without mentioning the loop might have been clearer.
Tilo Prütz
A: 

Your format string does the following actions:

  • Read (and discard) 1 or more non-@ characters
  • Read (and discard) 0 or more whitespace characters (due to the space in the format string)
  • Read and store 1 to 8999 non-; characters

Unfortunately, there is no format specifier for reading "zero or more" characters from a user-defined set.

If you don't care about multiple @include statements on a line, you could change your code to read a single line (with fgets), and then extract the @include statement from that (if the first character does not equal @, you can use your current format string with sscanf, otherwise, you could use sscanf(line, "%8999[^;]", buffer)).

If multiple @include statemens on a line should be handled correctly, you could inspect the next character to be read with getc and then put it back with ungetc.

Bart van Ingen Schenau
Thanks Bart. Unfortunately, I do care about multiple @import statements on a single line and 90% of the time, the line would open with a '@'. If there really is no way to have fscanf recognize the first character as a matching one, it looks like I'll have to do it manually.
Bryan