views:

984

answers:

11

Hi,

I'm trying to create a program that takes a text file of c++ code and outputs another file with that code, minus any comments that it contains.

Assuming that rFile and wFile are defined as follows:

ifstream rFile; // File stream object for read only
ofstream wFile; // File stream object for write only

rFile.open("input.txt", ios::in);
wFile.open("output.txt", ios::out);

My first thought was simply go through the text and do the equivalent of pen-up(logo reference) when a (slightly improved) peek() identifies /* and pen down when it sees */. Of course after seeing // it would "pen-up" until it reaches \n.

The problem with this approach is that the output.txt doesn't include any of the original spaces or newlines.

This was the code (I didn't even try removing comments at this stage):

while (!rFile.eof())
{
rFile>>first;  //first is a char
wFile<<first;
}

So then I tried getting each line of code separately with getline() and then adding an endl to the wFile. It works so far, but makes things so much more complicated, less elegant and the code less readable.

So, I'm wondering if anyone out there has any pointers for me. (no pun intended!)

N.B. This is part of a larger homework assignment that I've been given and I'm limited to using only C++ functions and not C ones.

+5  A: 

UPDATE:

Someone else mentioned this, but I think "get" is probably a better function to use, than ">>".

Original post:

The solution is to read the input character-by-character, rather than using "getline()".

You can read the characters in using ">>", and output them using "<<". That way you don't have to use "endl" at all. The line terminator and space characters will be read in as individual characters.

When you see the start of a comment, you can then just stop outputting characters until you eat the appropriate comment terminator.

You also need to make sure to treat "\r\n" as a single terminator when processing the end of a "//" token.

Scott Wisniewski
+2  A: 

Have you considered using the C++ library for regular expressions to find the comment strings? After locating them you could obviously just replace them with empty strings.

anderstornvig
This may cause problems when dealing with comments within uncommented lines of code.
Sev
I disagree. This is a problem where a little state machine, like the asker has proposed, is much more usable than a regex; perfomance-wise as well as readability-wise. A regex might have to match the whole file (could be one giant comment, after all) or many tiny comments. The state machine reads each character exactly once, makes a decision, and moves on.
balpha
Considered that this is homework assignment, I am not sure if the asker's teacher is okay with external library :P
m3rLinEz
Regular expressions is the wrong tool for this kind of problem
hlovdal
Regexs *are* state machines.
Paul Nathan
+1  A: 

Your problem is similar to using fstream to read every character including spaces and newline. If you want to read the file character by character, including new lines and spaces, try istream::get.

m3rLinEz
A: 

If you don't want or can't use use regular expressions, you should use the STL with functions like :

find_last_off

find_first_of

to identify the intervall of the string you're trying to remove. "\n" being the end of the line but that's a bit more complex.

BUT you should follow anderstornvig advice, regular expressions are now part of TR1 so that's a tool of C++ (if you use visual C++ 2008 including the express edition or a recent version of G++, if not use Boost).

Look for the third link for where to start.

For your exemple :

You should look for "//" after ";" Match all the text after "//" till the end of the line ($ in regex term)

Aslo, you should think of commentaries after curly braces also. /* commentaries etc. Plenty of special cases.

Getting started with C++ TR1 regular expressions

Regular Expression Tutorial

Finding Comments in (C) Source Code Using Regular Expressions

anno
+1  A: 

Read each char, and keep several bool variables. One bool for strings, other for characters, other for escaping, other for single line statements and other for block comments.

Only output your char when both single line statements and block comments are "false".

If you find a // or /* sequence and it's not within a string(so that "/*Abc*/" won't be cropped), trigger the adequated boolean.

Oh, I almost forgot. Line breaks and */ sequences should set the respective comment bool to false.

luiscubal
+1  A: 

In our compiler design class, we used flex and bison to do something similar.

we wrote in basic regular expressions to "tokenize" the file, and then we simply manipulated the tokens.

Ape-inago
A: 

I guess it is a little off-topic since you specifically said C++, but I think Perl or Python would be much easier to use. C and C++ are pains in the ass for string stuff.

You could:

  1. Replace ' *\/\/.*' with an empty string to get rid of // comments, and
  2. Read through the file, keeping a flag indicating whether you are inside a /* comment, and not writing anything if you are. Keep in mind that /* comments don't nest.

Edit: Be careful with number 1. I forgot that you have to make sure you aren't inside quotes. Don't use that regex.

c4757p
C++ and even C are not that regular that you want something this simple . Just consider line concatenation with \
MSalters
+1  A: 

I would use the istreambuf_iterator:
This allows you to iterate through the file one character at a time.

This also allows you to remove the processing logic from the looping logic the takes you through the file.

#include <iterator>
#include <iostream>
#include <algorithm>


class CommentFilter
{
    public:
        CommentFilter(std::ostream& output)
            :m_commentOn(false)
            ,m_output(output)
        {}

        // For each character we find call this method 
        void operator()(char c) const
        {
            // Check for a change in the comment state. (ie PenDown)
            // Leaving this for you to do.


            // Now print the stuff you want.
            if (!m_commentOn)
            {
                // If the commentOn is true then we don't print.
                // Otherwise we do.
                m_output << c;
            }
        }
    private:
        bool            m_commentOn;
        std::ostream&    m_output;
};

int main()
{
    CommentFilter   filter(std::cout);

    // The istreambuf_iterator allows you to iterate through a stream one obejct at a time.
    // In this case we define the object to be a char.
    //
    // So for each obejct (char) we find we call the functor filter with that object.
    // This means filer must have a method so that it can be called like this  filter('a')
    // To-Do this we define the operator() see-above.
    std::for_each(  std::istreambuf_iterator<char>(std::cin),
                    std::istreambuf_iterator<char>(),
                    filter
                );
}
Martin York
A: 

I tired to keep it simple and short :-)..

#include <stdio.h>


FILE *rfd,*wfd;
char ch;

void end()
{
    int c=0;
    switch((ch=fgetc(rfd)))
    {
    case '/':
      do
      {
       ch=fgetc(rfd);
       if(ch=='\n')
        break;
      }while(ch!=EOF);
      ch=fgetc(rfd);
      return;  

    case '*':do
      {
       c++;
       ch=fgetc(rfd);
       if(ch=='*' && (fgetc(rfd))=='/')
        break;
      }while(ch!=EOF);   
      if(ch==EOF)
       fseek(rfd,-c-1,2);
      ch=fgetc(rfd);
      return;

    default:
     fputc('/',wfd);
     return;
    }
}

int main (int argc,char **argv)
{

    rfd=fopen("read.txt","r");
    wfd=fopen("write.txt","w");

    while((ch=fgetc(rfd))!=EOF)
    {
     if(ch=='/')
       end();

     fputc(ch,wfd);
    }

    printf("\ndone ");
    fflush(stdin);
    getchar();
}
Aman
A: 

The >> operator isn't a complete solution. As you've found out, it likes to skip whitespace. Use the get() member function to get characters, getline() for lines.

Once you've done that, the fun begins.

The pen-up, pen-down method looks good to me. Then comes the problem of what's a comment.

You will want to keep track of quoted strings and character constants to make sure you're not pulling comment markers out of them. ('//' is legal, although implementation-defined, and doesn't start a comment.) You may want to note that a \" or ??/" inside a quoted string doesn't close the string, and similarly for character constants. You may want to note end-of-line subtleties: an end-of-line immediately preceded by \ or ??/ isn't actually an end-of-line. (Or you could ignore the trigraphs; almost everybody else does.)

If you're somewhat familiar with finite state machines (aka deterministic finite automata), you might want to use that approach. Essentially, you're in some state at all times, and on reading a character you perform an action that depends on state and character, and possibly change to another state.

For example, say you're in state READING_ALONG, and you encounter a /. You write nothing and change to SAW_A_SLASH state. If the next character is *, you enter the C_STYLE_COMMENT state; if it's /, you enter the CPP_STYLE_COMMENT state, and if it isn't you print "/" and the current character, and go back to READING_ALONG.

David Thornley
A: 

You have a number of states to consider:

  • the state that you are in side of a single quoted string
  • the state that you are in a double quoted string
  • the state where you have found a //
  • the state where you have found a /*
  • and lastly, a single \ on the end of a line,

which can cause some pretty messed up formatting where the compiler and the text highlighter disagree:

include <stdio>;
INT someVariable = 0;
/* where does this comment end? *\
///  I don't know
someVariable = 6;  
// most text editors don't think it ends until here --> */\
   but someVariable = 6;  shouldnt actually be commented out, and this line should be! \
this is also part of the comment ,   a "3 line " one line comment? WTF!
std::cout << someVariable << std::endl;
// even though "someVariable=6" appears to be commented out, it shouldn't be.
// so this will print "6"

// /* \
*/this text should be commented out aswell

runnting that code through a comment stripper should return:

include <stdio>;
INT someVariable = 0;
someVariable = 6;  
std::cout << someVariable << std::endl;

and the fun part is when you have to have compiler errors refer to lines of code according to the orignal mess, not the stripped version.

Ape-inago