views:

340

answers:

7

Kind of looking for affirmation here. I have some hand-written code, which I'm not shy to say I'm proud of, which reads a file, removes leading whitespace, processes newline escapes '\' and removes comments starting with #. It also removes all empty lines (also whitespace-only ones). Any thoughts/recommendations? I could probably replace some std::cout's with std::runtime_errors... but that's not a priority here :)

const int RecipeReader::readRecipe()
{
    ifstream is_recipe(s_buffer.c_str());
    if (!is_recipe)
        cout << "unable to open file" << endl;
    while (getline(is_recipe, s_buffer))
    {
        // whitespace+comment
        removeLeadingWhitespace(s_buffer);
        processComment(s_buffer);
        // newline escapes + append all subsequent lines with '\'
        processNewlineEscapes(s_buffer, is_recipe);
        // store the real text line
        if (!s_buffer.empty())
            v_s_recipe.push_back(s_buffer);
        s_buffer.clear();
    }
    is_recipe.close();
    return 0;
}

void RecipeReader::processNewlineEscapes(string &s_string, ifstream &is_stream)
{
    string s_temp;
    size_t sz_index = s_string.find_first_of("\\");
    while (sz_index <= s_string.length())
    {
        if (getline(is_stream,s_temp))
        {
            removeLeadingWhitespace(s_temp);
            processComment(s_temp);
            s_string = s_string.substr(0,sz_index-1) + " " + s_temp;
        }
        else
            cout << "Error: newline escape '\' found at EOF" << endl;
        sz_index = s_string.find_first_of("\\");
    }
}

void RecipeReader::processComment(string &s_string)
{
    size_t sz_index = s_string.find_first_of("#");
    s_string = s_string.substr(0,sz_index);
}

void RecipeReader::removeLeadingWhitespace(string &s_string)
{
    const size_t sz_length = s_string.size();
    size_t sz_index = s_string.find_first_not_of(" \t");
    if (sz_index <= sz_length)
    s_string = s_string.substr(sz_index);
    else if ((sz_index > sz_length) && (sz_length != 0)) // "empty" lines with only whitespace
        s_string.clear();
}

Some extra info: the first s_buffer passed to the ifstream contains the filename, std::string s_buffer is a class data member, so is std::vector v_s_recipe. Any comment is welcome :)

UPDATE: for the sake of not being ungrateful, here is my replacement, all-in-one function that does what I want for now (future holds: parenthesis, maybe quotes...):

void readRecipe(const std::string &filename)
{
    string buffer;
    string line;
    size_t index;
    ifstream file(filename.c_str());
    if (!file)
        throw runtime_error("Unable to open file.");

    while (getline(file, line))
    {
        // whitespace removal
        line.erase(0, line.find_first_not_of(" \t\r\n\v\f"));
        // comment removal TODO: store these for later output
        index = line.find_first_of("#");
        if (index != string::npos)
            line.erase(index, string::npos);
        // ignore empty buffer
        if (line.empty())
            continue;
        // process newline escapes
        index = line.find_first_of("\\");
        if (index != string::npos)
        {
            line.erase(index,string::npos); // ignore everything after '\'
            buffer += line;
            continue; // read next line
        }
        else // no newline escapes found
        {
            buffer += line;
            recipe.push_back(buffer);
            buffer.clear();
        }
    }
}
+8  A: 

Definitely ditch the hungarian notation.

SoapBox
this is so true
Nikko
The "is_" prefix is particularly unfortunate.
Stephen
or `s_string` …
knittl
I did not take the time to comment on this but Joel Spolsky had made a very great article about Hungarian Notation and how it had been perverted. The form you are using here is the perverted one: denoting the type of the variable is redundant and does not bring anything to the table. If you wish to read more, I can only recommend Joel's article http://www.joelonsoftware.com/articles/Wrong.html. As usual with Joel it's a bit lengthy, you can use the search functionality of your browser to get to the point ;)
Matthieu M.
Thanks for the read, I had already ditched it due to all the negative comments on StackOverflow :).
rubenvb
+5  A: 

It's not bad, but I think you're thinking of std::basic_string<T> too much as a string and not enough as an STL container. For example:

void RecipeReader::removeLeadingWhitespace(string &s_string)
{
    s_string.erase(s_string.begin(), 
        std::find_if(s_string.begin(), s_string.end(), std::not1(isspace)));
}
Billy ONeal
I'm reading up on isspace and there seem to be problems with locales and all. Should I be worried or try using something else closer to my current implementation?
rubenvb
@rubenvb: It has locale problems, yes, but no more than using `" \t"` does ;) You could use `boost::is_any_of(" \t")` if you're concerned.
Billy ONeal
More easily done by `istringstream::skipws`. And technically, `string` is not a container.
Potatoswatter
@Potatoswatter: For what reason is `string` not a container? It seems to meet the requirements listed here: http://www.sgi.com/tech/stl/Container.html
Billy ONeal
@Billy: It requires a `char_traits` specialization for the contained type. On second thought, I guess it depends on interpretation of 23.1/3 as a necessary or sufficient condition. And 21.3/2 does assert that `basic_string` is a Reversible Sequence. So I guess technically you're right :v) . Anyway, the right tool for the job here is a stringstream, since the whole line will eventually get parsed. For a multipass parser, which looks like the goal here but could be overkill, OP could store an array of stringstreams instead of an array of strings.
Potatoswatter
@Potatoswatter: what would be the advantage of stringstreams of a plain vector of strings? (please don't forget my update of the question). I was planning on multi-pass parser, just because it's easier (still learning), and allows a more step-wise approach. I was planning on using the operator>> to extract token per token and handle the lines like that.
rubenvb
@ruben: because `operator>>` requires a stream as an operand. I don't really recommend an array of `stringstream` (you would have to subclass `stringstream` to put in a `vector` anyway). I recommend a single-pass parser first, as it will actually be easier, and rewriting the input as a string with newlines into a single second-pass `stringstream` as the next best alternative.
Potatoswatter
The multipass parser was a WIP way of thinking (it allows easy checking of the different steps performed), I'll think about making the whole thing processed in one go.
rubenvb
+4  A: 

I'm not big on methods that modify the parameters. Why not return strings rather than modifying the input arguments? For example:

string RecipeReader::processComment(const string &s)
{
    size_t index = s.find_first_of("#");
    return s_string.substr(0, index);
}

I personally feel this clarifies intent and makes it more obvious what the method is doing.

Samir Talwar
+1  A: 

I'd consider replacing all your processing code (almost everything you've written) with boost::regex code.

Brent Arias
Yuck! (Even if it wasn't yuck, who wants to pull in a several hundred kb library for a few simple string processes like this?)
Billy ONeal
My thought exactly when I was thinking of how to tackle this problem. And heck, I'd use tr1/regex then, not boost :)
rubenvb
+4  A: 

A few comments:

  • As another answer (+1 from me) said - ditch the hungarian notation. It really doesn't do anything but add unimportant trash to every line. In addition, ifstream yielding an is_ prefix is ugly. is_ usually indicates a boolean.
  • Naming a function with processXXX gives very very little information on what it is actually doing. Use removeXXX, like you did with the RemoveLeadingWhitespace function.
  • The processComment function does an unnecessary copy and assignment. Use s.erase(index, string::npos); (npos is default, but this is more obvious).
  • It's not clear what your program does, but you might consider storing a different file format (like html or xml) if you need to post-process your files like this. That would depend on the goal.
  • using find_first_of('#') may give you some false positives. If it's present in quotes, it's not necessarily indicating a comment. (But again, this depends on your file format)
  • using find_first_of(c) with one character can be simplified to find(c).
  • The processNewlineEscapes function duplicates some functionality from the readRecipe function. You may consider refactoring to something like this:

-

string s_buffer;
string s_line;
while (getline(is_recipe, s_line)) {
  // Sanitize the raw line.
  removeLeadingWhitespace(s_line);
  removeComments(s_line);
  // Skip empty lines.
  if (s_line.empty()) continue;
  // Add the raw line to the buffer.
  s_buffer += s_line;
  // Collect buffer across all escaped lines.
  if (*s_line.rbegin() == '\\') continue;
  // This line is not escaped, now I can process the buffer.
  v_s_recipe.push_back(s_buffer);
  s_buffer.clear();
}
Stephen
Thanks. This is actually a preprocessor from a "project file" for a "build system" (personal project, academic, don't take me seriously on the "build system"). The newline escapes are used here like Qt4's qmake reads them, for readability in the file itself. I'll also take a look at the duplicate functionality, and perhaps it will eventually all fit in one function <crosses fingers>. Will also get to fixing the mess and try again with more STL functionality and better function names.
rubenvb
@rubenvb: Ah, I see. I realized what the "escapes" were after rereading :) Augmented the answer with an idea for how you can bring the two functions together.
Stephen
Thanks for the continue idea, never crossed my mind :s
rubenvb
+1  A: 

A few comments:

  • If s_buffer contains the file name to be opened, it should have a better name like s_filename.
  • The s_buffer member variable should not be reused to store temporary data from reading the file. A local variable in the function would do as well, no need for the buffer to be a member variable.
  • If there is not need to have the filename stored as a member variable it could just be passed as a parameter to readRecipe()

  • processNewlineEscapes() should check that the found backslash is at the end of the line before appending the next line. At the moment any backslash at any position triggers adding of the next line at the position of the backslash. Also, if there are several backslashes, find_last_of() would probably easier to use than find_first_of().

  • When checking the result of find_first_of() in processNewlineEscapes() and removeLeadingWhitespace() it would be cleaner to compare against string::npos to check if anything was found.

  • The logic at the end of removeLeadingWhitespace() could be simplified:

    size_t sz_index = s_string.find_first_not_of(" \t");
    if (sz_index != s_string.npos)
       s_string = s_string.substr(sz_index);
    else // "empty" lines with only whitespace
       s_string.clear();
    
sth
I'm changing names and rearranging my class right now. The backslash thing is what I want, everything after a backslash is ignored. I'm also using npos (didn't know how to use it before. You guys are awesome :D
rubenvb
A: 

You might wish to have a look at Boost.String. It's a simple collection of algorithms to work with streams, and notably features trim methods :)

Now, on to the review itself:

Don't bother to remove the hungarian notation, if it's your style then use it, however you should try and improve the names of methods and variables. processXXX is definitely not indicating anything useful...

Functionally, I am worried about your assumptions: the main issue here is that you do not care for espace sequences (\n uses a backslash for example) and you do not worry for the presence of strings of charachters: std::cout << "Process #" << pid << std::endl; would yield an invalid line because of your "comment" preprocessing

Furthermore, since you remove the comments before processing the newline escapes:

i = 3; # comment \
         running comment

will be parsed as

i = 3; running comment

which is syntactically incorrect.

From an interface point of view: there is not benefit in having the methods being class members here, you don't need an instance of RecipeReader really...

And finally, I find it awkward that two methods would read from the stream.

Little peeve of mine: returning by const value does not serve any purpose.

Here is my own version, as I believe than showing is easier than discussing:

// header file

std::vector<std::string> readRecipe(const std::string& fileName);

std::string extractLine(std::ifstream& file);

std::pair<std:string,bool> removeNewlineEscape(const std::string& line);
std::string removeComment(const std::string& line);

// source file

#include <boost/algorithm/string.hpp>

std::vector<std::string> readRecipe(const std::string& fileName)
{
  std::vector<std::string> result;

  ifstream file(fileName.c_str());
  if (!file) std::cout << "Could not open: " << fileName << std::endl;

  std::string line = extractLine(file);
  while(!line.empty())
  {
    result.push_back(line);
    line = extractLine(file);
  } // looping on the lines

  return result;
} // readRecipe


std::string extractLine(std::ifstream& file)
{
  std::string line, buffer;
  while(getline(file, buffer))
  {
    std::pair<std::string,bool> r = removeNewlineEscape(buffer);
    line += boost::trim_left_copy(r.first); // remove leading whitespace
                                            // based on the current locale
    if (!r.second) break;
    line += " "; // as we append, we insert a whitespace
                 // in order unintended token concatenation
  }

  return removeComment(line);
} // extractLine

//< Returns the line, minus the '\' character
//<         if it was the last significant one
//< Returns a boolean indicating whether or not the line continue
//<         (true if it's necessary to concatenate with the next line)
std::pair<std:string,bool> removeNewlineEscape(const std::string& line)
{
  std::pair<std::string,bool> result;
  result.second = false;

  size_t pos = line.find_last_not_of(" \t");
  if (std::string::npos != pos && line[pos] == '\')
  {
    result.second = true;
    --pos; // we don't want to have this '\' character in the string
  }

  result.first = line.substr(0, pos);
  return result;
} // checkNewlineEscape

//< The main difficulty here is NOT to confuse a # inside a string
//< with a # signalling a comment
//< assuming strings are contained within "", let's roll
std::string removeComment(const std::string& line)
{
  size_t pos = line.find_first_of("\"#");
  while(std::string::npos != pos)
  {
    if (line[pos] == '"')
    {
      // We have detected the beginning of a string, we move pos to its end
      // beware of the tricky presence of a '\' right before '"'...
      pos = line.find_first_of("\"", pos+1);
      while (std::string::npos != pos && line[pos-1] == '\')
        pos = line.find_first_of("\"", pos+1);
    }
    else // line[pos] == '#'
    {
      // We have found the comment marker in a significant position
      break;
    }
    pos = line.find_first_of("\"#", pos+1);
  } // looking for comment marker

  return line.substr(0, pos);
} // removeComment

It is fairly inefficient (but I trust the compiler for optmizations), but I believe it behaves correctly though it's untested so take it with a grain of salt. I have focused mainly on solving the functional issues, the naming convention I follow is different from yours but I don't think it should matter.

Matthieu M.
Thanks for the remarks, I've already solved most of them. This things is part of a bigger picture, and I'd like this functionality in a class, it is C++, not plain C after all. I'm trying not to use boost, and in amount of code, the end result (see UPDATE) is shorter than your boost version. I'll give the newline/comment order some thought, but I was planning on every comment-line having a '#', much like C++ comments '//', so you're running comment is indeed a syntax error (and will be handled later, when I have the courage to write a syntax checker. Quotes and parenthesis are WIP :) Thanks
rubenvb