tags:

views:

40

answers:

2

I feel like this is a pretty common problem but I wasn't really sure what to search for.

I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.

So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".

In my current approach I have the following problems:

  • What happens when the control strings falls on a buffer boundary?
  • If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.

I'm writing this in straight C so I don't have streams to help me.

Would an alternating double buffer approach work and if so how does one manage the current locations etc.

+2  A: 

If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.

Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.

You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).

Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.

The regular expressions would be:

 [^$]

Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.

$[0-9]{1,3};

This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.

[$]

This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".

Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.

In lex you might have

[^$]{1,1024}          { write_string(yytext); }
$[0-9]{1,3};          { write_char(atoi(yytext)); }
[$]                   { write_char(*yytext); }

and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.

nategoose
+1, quite a long post but answers the question well.
casablanca
That is exactly what I needed to get started. Thanks. I ended up finding a light weight lexer http://www.complang.org/ragel/
joegtp
A: 

The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.

Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Blrfl