ansaurus

Question

[C++] std::string manipulation: whitespace, "newline escapes '\'" and comments #

Answer 1

+8 A:

Definitely ditch the hungarian notation.

SoapBox 2010-05-22 14:57:36

this is so true

Nikko 2010-05-22 14:59:06

The "is_" prefix is particularly unfortunate.

Stephen 2010-05-22 15:01:48

or `s_string` …

knittl 2010-05-22 15:25:20

I did not take the time to comment on this but Joel Spolsky had made a very great article about Hungarian Notation and how it had been perverted. The form you are using here is the perverted one: denoting the type of the variable is redundant and does not bring anything to the table. If you wish to read more, I can only recommend Joel's article http://www.joelonsoftware.com/articles/Wrong.html. As usual with Joel it's a bit lengthy, you can use the search functionality of your browser to get to the point ;)

Matthieu M. 2010-05-23 12:29:27

Thanks for the read, I had already ditched it due to all the negative comments on StackOverflow :).

rubenvb 2010-05-24 16:07:27

Answer 2

+5 A:

It's not bad, but I think you're thinking of std::basic_string<T> too much as a string and not enough as an STL container. For example:

void RecipeReader::removeLeadingWhitespace(string &s_string)
{
    s_string.erase(s_string.begin(), 
        std::find_if(s_string.begin(), s_string.end(), std::not1(isspace)));
}

Billy ONeal 2010-05-22 15:01:51

I'm reading up on isspace and there seem to be problems with locales and all. Should I be worried or try using something else closer to my current implementation?

rubenvb 2010-05-22 15:46:10

@rubenvb: It has locale problems, yes, but no more than using `" \t"` does ;) You could use `boost::is_any_of(" \t")` if you're concerned.

Billy ONeal 2010-05-22 16:06:58

More easily done by `istringstream::skipws`. And technically, `string` is not a container.

Potatoswatter 2010-05-22 18:41:32

@Potatoswatter: For what reason is `string` not a container? It seems to meet the requirements listed here: http://www.sgi.com/tech/stl/Container.html

Billy ONeal 2010-05-23 06:40:02

@Billy: It requires a `char_traits` specialization for the contained type. On second thought, I guess it depends on interpretation of 23.1/3 as a necessary or sufficient condition. And 21.3/2 does assert that `basic_string` is a Reversible Sequence. So I guess technically you're right :v) . Anyway, the right tool for the job here is a stringstream, since the whole line will eventually get parsed. For a multipass parser, which looks like the goal here but could be overkill, OP could store an array of stringstreams instead of an array of strings.

Potatoswatter 2010-05-23 07:29:03

@Potatoswatter: what would be the advantage of stringstreams of a plain vector of strings? (please don't forget my update of the question). I was planning on multi-pass parser, just because it's easier (still learning), and allows a more step-wise approach. I was planning on using the operator>> to extract token per token and handle the lines like that.

rubenvb 2010-05-23 12:30:45

@ruben: because `operator>>` requires a stream as an operand. I don't really recommend an array of `stringstream` (you would have to subclass `stringstream` to put in a `vector` anyway). I recommend a single-pass parser first, as it will actually be easier, and rewriting the input as a string with newlines into a single second-pass `stringstream` as the next best alternative.

Potatoswatter 2010-05-23 18:27:59

The multipass parser was a WIP way of thinking (it allows easy checking of the different steps performed), I'll think about making the whole thing processed in one go.

rubenvb 2010-05-24 16:09:20

Answer 3

+4 A:

I'm not big on methods that modify the parameters. Why not return strings rather than modifying the input arguments? For example:

string RecipeReader::processComment(const string &s)
{
    size_t index = s.find_first_of("#");
    return s_string.substr(0, index);
}

I personally feel this clarifies intent and makes it more obvious what the method is doing.

Samir Talwar 2010-05-22 15:03:57

Answer 4

+1 A:

I'd consider replacing all your processing code (almost everything you've written) with boost::regex code.

Brent Arias 2010-05-22 15:06:34

Yuck! (Even if it wasn't yuck, who wants to pull in a several hundred kb library for a few simple string processes like this?)

Billy ONeal 2010-05-22 15:27:34

My thought exactly when I was thinking of how to tackle this problem. And heck, I'd use tr1/regex then, not boost :)

rubenvb 2010-05-22 15:28:42

Answer 5

+4 A:

A few comments:

As another answer (+1 from me) said - ditch the hungarian notation. It really doesn't do anything but add unimportant trash to every line. In addition, ifstream yielding an is_ prefix is ugly. is_ usually indicates a boolean.
Naming a function with processXXX gives very very little information on what it is actually doing. Use removeXXX, like you did with the RemoveLeadingWhitespace function.
The processComment function does an unnecessary copy and assignment. Use s.erase(index, string::npos); (npos is default, but this is more obvious).
It's not clear what your program does, but you might consider storing a different file format (like html or xml) if you need to post-process your files like this. That would depend on the goal.
using find_first_of('#') may give you some false positives. If it's present in quotes, it's not necessarily indicating a comment. (But again, this depends on your file format)
using find_first_of(c) with one character can be simplified to find(c).
The processNewlineEscapes function duplicates some functionality from the readRecipe function. You may consider refactoring to something like this:

-

string s_buffer;
string s_line;
while (getline(is_recipe, s_line)) {
  // Sanitize the raw line.
  removeLeadingWhitespace(s_line);
  removeComments(s_line);
  // Skip empty lines.
  if (s_line.empty()) continue;
  // Add the raw line to the buffer.
  s_buffer += s_line;
  // Collect buffer across all escaped lines.
  if (*s_line.rbegin() == '\\') continue;
  // This line is not escaped, now I can process the buffer.
  v_s_recipe.push_back(s_buffer);
  s_buffer.clear();
}

Stephen 2010-05-22 15:18:28

Thanks. This is actually a preprocessor from a "project file" for a "build system" (personal project, academic, don't take me seriously on the "build system"). The newline escapes are used here like Qt4's qmake reads them, for readability in the file itself. I'll also take a look at the duplicate functionality, and perhaps it will eventually all fit in one function <crosses fingers>. Will also get to fixing the mess and try again with more STL functionality and better function names.

rubenvb 2010-05-22 15:25:57

@rubenvb: Ah, I see. I realized what the "escapes" were after rereading :) Augmented the answer with an idea for how you can bring the two functions together.

Stephen 2010-05-22 15:39:53

Thanks for the continue idea, never crossed my mind :s

rubenvb 2010-05-22 16:34:36

Answer 6

+1 A:

A few comments:

If s_buffer contains the file name to be opened, it should have a better name like s_filename.
The s_buffer member variable should not be reused to store temporary data from reading the file. A local variable in the function would do as well, no need for the buffer to be a member variable.
If there is not need to have the filename stored as a member variable it could just be passed as a parameter to readRecipe()
processNewlineEscapes() should check that the found backslash is at the end of the line before appending the next line. At the moment any backslash at any position triggers adding of the next line at the position of the backslash. Also, if there are several backslashes, find_last_of() would probably easier to use than find_first_of().
When checking the result of find_first_of() in processNewlineEscapes() and removeLeadingWhitespace() it would be cleaner to compare against string::npos to check if anything was found.

The logic at the end of removeLeadingWhitespace() could be simplified:

size_t sz_index = s_string.find_first_not_of(" \t");
if (sz_index != s_string.npos)
   s_string = s_string.substr(sz_index);
else // "empty" lines with only whitespace
   s_string.clear();

sth 2010-05-22 15:34:13

I'm changing names and rearranging my class right now. The backslash thing is what I want, everything after a backslash is ignored. I'm also using npos (didn't know how to use it before. You guys are awesome :D

rubenvb 2010-05-22 15:50:39

Answer 7

A:

You might wish to have a look at Boost.String. It's a simple collection of algorithms to work with streams, and notably features trim methods :)

Now, on to the review itself:

Don't bother to remove the hungarian notation, if it's your style then use it, however you should try and improve the names of methods and variables. processXXX is definitely not indicating anything useful...

Functionally, I am worried about your assumptions: the main issue here is that you do not care for espace sequences (\n uses a backslash for example) and you do not worry for the presence of strings of charachters: std::cout << "Process #" << pid << std::endl; would yield an invalid line because of your "comment" preprocessing

Furthermore, since you remove the comments before processing the newline escapes:

i = 3; # comment \
         running comment

will be parsed as

i = 3; running comment

which is syntactically incorrect.

From an interface point of view: there is not benefit in having the methods being class members here, you don't need an instance of RecipeReader really...

And finally, I find it awkward that two methods would read from the stream.

Little peeve of mine: returning by const value does not serve any purpose.

Here is my own version, as I believe than showing is easier than discussing:

// header file

std::vector<std::string> readRecipe(const std::string& fileName);

std::string extractLine(std::ifstream& file);

std::pair<std:string,bool> removeNewlineEscape(const std::string& line);
std::string removeComment(const std::string& line);

// source file

#include <boost/algorithm/string.hpp>

std::vector<std::string> readRecipe(const std::string& fileName)
{
  std::vector<std::string> result;

  ifstream file(fileName.c_str());
  if (!file) std::cout << "Could not open: " << fileName << std::endl;

  std::string line = extractLine(file);
  while(!line.empty())
  {
    result.push_back(line);
    line = extractLine(file);
  } // looping on the lines

  return result;
} // readRecipe


std::string extractLine(std::ifstream& file)
{
  std::string line, buffer;
  while(getline(file, buffer))
  {
    std::pair<std::string,bool> r = removeNewlineEscape(buffer);
    line += boost::trim_left_copy(r.first); // remove leading whitespace
                                            // based on the current locale
    if (!r.second) break;
    line += " "; // as we append, we insert a whitespace
                 // in order unintended token concatenation
  }

  return removeComment(line);
} // extractLine

//< Returns the line, minus the '\' character
//<         if it was the last significant one
//< Returns a boolean indicating whether or not the line continue
//<         (true if it's necessary to concatenate with the next line)
std::pair<std:string,bool> removeNewlineEscape(const std::string& line)
{
  std::pair<std::string,bool> result;
  result.second = false;

  size_t pos = line.find_last_not_of(" \t");
  if (std::string::npos != pos && line[pos] == '\')
  {
    result.second = true;
    --pos; // we don't want to have this '\' character in the string
  }

  result.first = line.substr(0, pos);
  return result;
} // checkNewlineEscape

//< The main difficulty here is NOT to confuse a # inside a string
//< with a # signalling a comment
//< assuming strings are contained within "", let's roll
std::string removeComment(const std::string& line)
{
  size_t pos = line.find_first_of("\"#");
  while(std::string::npos != pos)
  {
    if (line[pos] == '"')
    {
      // We have detected the beginning of a string, we move pos to its end
      // beware of the tricky presence of a '\' right before '"'...
      pos = line.find_first_of("\"", pos+1);
      while (std::string::npos != pos && line[pos-1] == '\')
        pos = line.find_first_of("\"", pos+1);
    }
    else // line[pos] == '#'
    {
      // We have found the comment marker in a significant position
      break;
    }
    pos = line.find_first_of("\"#", pos+1);
  } // looking for comment marker

  return line.substr(0, pos);
} // removeComment

It is fairly inefficient (but I trust the compiler for optmizations), but I believe it behaves correctly though it's untested so take it with a grain of salt. I have focused mainly on solving the functional issues, the naming convention I follow is different from yours but I don't think it should matter.

Matthieu M. 2010-05-22 17:17:03

Thanks for the remarks, I've already solved most of them. This things is part of a bigger picture, and I'd like this functionality in a class, it is C++, not plain C after all. I'm trying not to use boost, and in amount of code, the end result (see UPDATE) is shorter than your boost version. I'll give the newline/comment order some thought, but I was planning on every comment-line having a '#', much like C++ comments '//', so you're running comment is indeed a syntax error (and will be handled later, when I have the courage to write a syntax checker. Quotes and parenthesis are WIP :) Thanks

rubenvb 2010-05-22 18:40:21

ansaurus

tags:

views:

answers:

[C++] std::string manipulation: whitespace, "newline escapes '\'" and comments #

related questions