ansaurus

Question

Sed to remove underscores and promote character

Answer 1

+3 A:

Consider using sed to search and replace all text like this. Without a C++ tokenizer to recognize identifiers (and specifically your identifiers and not those in the standard library, e.g.), you are screwed. push_back gets renamed to pushBack_. map::insert to map::insert_. map to map_. basic_string to basicString_. printf to printf_ (if you use C libraries), etc. You're going to be in a world of hurt if you do it indiscriminately.

I don't know of any existing tool to automagically rename some_var_name to someVarName_ without the problems described above. People voted this post down probably because they didn't understand what I meant here. I'm not saying sed can't do it, I'm just saying it won't give you what you want to just use it as is. The parser needs contextual information to do this right, else it'll replace a lot more things it shouldn't than it should.

It would be possible to write a parser that would do this (ex: using sed) if it could recognize which tokens were identifiers (specifically your identifiers), but I doubt there's a tool specifically for what you want to do that does it off the bat without some manual elbow grease (though I could be wrong). Doing a simple search and replace on all text this way would be inherently problematic.

However, Visual AssistX (which can optionally replace instances in documentation) or any other refactoring tool capable of smartly renaming identifiers for every instance in which they occur at least eases the burden of refactoring code this way quite considerably. If you have a symbol named some_var_name and it's referenced in a thousand different places in your system, with VAssistX you can just use one rename function to rename all references smartly (this is not a mere text search and replace). Check out the refactoring features of Visual Assist X.

It might take 15 minutes to a half hour to refactor a hundred variables this way with VAX (faster if you use the hotkeys), but it certainly beats using a text search and replace with sed like described in the other answer and having all kinds of code replaced that shouldn't be replaced.

*[subjective]BTW: underscores still don't belong in camel case if you ask me. A lowerCamelCase naming convention should use lowerCamelCase. There are plenty of interesting papers on this, but at least your convention is consistent. If it's consistent, then that's a huge plus as opposed to something like fooBar_Baz which some goofy coders write who think it somehow makes things easier to make special exceptions to the rule.[/subjective]*

2010-06-29 01:11:03

to clarify the naming convention shown is for member variables, the underscore at the end is to identify them as such. I prefer this to m_varName or _varName. Also I already have refactoring ability using QT Creator but I still don't fancy hand changing 100 or so variables.

radman 2010-06-29 01:28:34

Unfortunately this is about the only reliable way I know of with existing tools to do this. You can't simply search and replace source files indiscriminately with sed or any other general regex parser without getting more things replaced that you don't want to replace which is generally going to be more time-consuming than using a refactoring tool like VAX to selectively rename everything.

2010-06-29 05:16:39

+1 I agree with you that _sed_ is dangerous. And that elbow grease is required.

Joseph Quinsey 2010-06-29 07:07:47

@Joseph yay thanks, I'm back at 0. :-D

2010-06-29 07:12:42

I take your point about the use of refactoring tools in this process, I just finished fixing the code with vineets answer and I did get a number of false positives. However I got around these quite easily with a secondary find/replace to fix the commonly borked names (like the ones you mention sharedPtr_ pushBack_ etc) and a diff and merge before applying the changes. Overall the process was relatively painless. I think that the simple sed technique becomes more economical the more that needs to be changed like in Joseph's case. To be fair to your answer I really was looking for a Sed solution

radman 2010-06-29 07:21:57

@radman Ah, cheers, I'm glad you found a working solution! I suppose it depends on the code base. In my particular case, I've refactored code before based on changing coding standards, but our system consists of several thousand source files and a whole lot of external libraries used: platform-specific libraries, OpenGL libraries including glew, FBX, OpenImageIO, boost, C++ standard library, C standard library, etc. The number of false positives for our case would have been gigantic so selectively refactoring with VAX turned out to be a lot safer and less tedious.

2010-06-29 07:44:00

We've also had bad experiences previously from other developers trying to replace symbols in our system by brute search and replace with regular expressions so I'm kind of biased in this regard against such solutions normally.

2010-06-29 07:46:10

@stinky472: I see the OP is talking about only '100 or so variables'. If so, then _hand-changing_ these using a refactoring tool would only take only _two hours_, at the rate of one change per minute. Or alternatively, using **sed** and then fixing up mistakes would be 'relatively painless'. In any case, my 'two day' solution is useless.

Joseph Quinsey 2010-06-29 08:31:55

Answer 2

+3 A:

sed -re 's,[a-z]+(_[a-z]+)+,&_,g' -e 's,_([a-z]),\u\1,g'

Explanation:

This is a sed command with 2 expressions (each in quotes after a -e.) s,,,g is a global substitution. You usually see it with slashes instead of commas, but I think this is easier to read when you're using backslashes in the patterns (and no commas). The trailing g (for "global") means to apply this substitution to all matches on each line, rather than just the first.

The first expression will append an underscore to every token made up of a lowercase word ([a-z]+) followed by a nonzero number of lowercase words separated by underscores ((_[a-z]+)+). We replace this with &_, where & means "everything that matched", and _ is just a literal underscore. So in total, this expression is saying to add an underscore to the end of every underscore_separated_lowercase_token.

The second expression matches the pattern _([a-z])), where everything between ( and ) is a capturing group. This means we can refer back to it later as \1 (because it's the first capturing group. If there were more, they would be \2, \3, and so on.). So we're saying to match a lowercase letter following an underscore, and remember the letter.

We replace it with \u\1, which is the letter we just remembered, but made uppercase by that \u.

This code doesn't do anything clever to avoid munging #include lines or the like; it will replace every instance of a lowercase letter following an underscore with its uppercase equivalent.

Vineet 2010-06-29 02:07:15

BTW, sed -i $filename is how you would invoke sed to edit $filename in place. So you could do, for example:"sed -i -r -e ... *.c"

Vineet 2010-06-29 02:15:46

Thanks for the answer Vineet, you were first in with a workable solution and it worked exactly as requested. Also props for the clear explanation of the functioning of the Sed command.

radman 2010-06-29 07:24:01

Answer 3

+3 A:

A few years ago I successfully converted a legacy 300,000 LOC 23-year-old code base to camelCase. It took only two days. But there were a few lingering affects that took a couple of months to sort out. And it is an very good way to annoy your fellow coders.

I believe that a simple, dumb, sed-like approach has advantages. IDE-based tools, and the like, cannot, as far as I know:

change code not compiled via #ifdef's
change code in comments

And the legacy code had to be maintained on several different compiler/OS platforms (= lots of #ifdefs).

The main disadvantage of a dumb, sed-like approach is that strings (such as keywords) can inadvertently be changed. And I've only done this for C; C++ might be another kettle of fish.

There are about five stages:

1) Generate a list of tokens that you wish to change, and manually edit.
2) For each token in that list, determine the new token.
3) Apply these changes to your code base.
4) Compile.
5) Double-check via a manual diff, and do a final clean-up.

For step 1, to generate a list of tokens that you wish to change, the command:

cat *.[ch] | sed 's/\([_A-Za-z0-9][_A-Za-z0-9]*\)/\nzzz \1\n/g' | grep -w zzz | sed 's/^zzz //' | grep '_[a-z]' | sort -u > list1

will produce in list1:

st_atime
time_t
...

In this sample, you really don't want to change these two tokens, so manually edit the list to delete them. But you'll probably miss some, so for the sake of this example, suppose you keep these.

The next step, 2, is to generate a script to do the changes. For example, the command:

cat list1 | sed 's/\(.*\)/glob_sub "\\<\1\\>" xxxx_\1/;s/\(xxxx_.*\)_a/\1A/g;s/\(xxxx_.*\)_b/\1B/g;s/\(xxxx_.*\)_a/\1C/g;s/\(xxxx_.*\)_t/\1T/g' | sed 's/zzz //' > list2

will change _a, _b, _c, and _t to A, B, C, and T, to produce:

glob_sub "\<st_atime\>" xxxx_stAtime
glob_sub "\<time_t\>" xxxx_timeT

You just have to extend it to cover d, e, f, ..., x, y, z,

I'm presuming you have already written something like 'glob_sub' for your development environment. (If not, give up now.) My version (csh, Cygwin) looks like:

#!/bin/csh
foreach file (`grep -l "$1" */*.[ch] *.[ch]`)
  /bin/mv -f $file $file.bak
  /bin/sed "s/$1/$2/g" $file.bak > $file
end

(Some of my sed's don't support the --in-place option, so I have to use a mv.)

The third step is to apply this script in list2 to your code base. For example, in csh use source list2.

The fourth step is to compile. The compiler will (hopefully!) object to xxxx_timeT. Indeed, it should likely object to just timeT but the extra xxx_ adds insurance. So for time_t you've made a mistake. Undo it with e.g.

glob_sub "\<xxxx_timeT\>" time_t

The fifth and final step is to do a manual inspection of your changes using your favorite diff utility, and then clean-up by removing all the unwanted xxx_ prefixes. Grepping for "xxx_ will also help check for tokens in strings. (Indeed, adding a _xxx suffix is probably a good idea.)

Joseph Quinsey 2010-06-29 05:05:25

+1 for showing how to use sed to actually build a proper solution. Note that manually filtering this list to opt out all the identifiers you don't want to replace may be more time-consuming than opting in to all the identifiers you do want to replace.

2010-06-29 05:19:30

@stinky472: Thank you for your comments. I was recollecting from five years ago. And I realize I omitted a key point. The issues with things such as time_t were _negligible_ --this was C, not BOOST. Rather, it was third-party header files used for messaging, and which were changed every few months. So we couldn't touch them. But we ran the first script over these header files to identify tokens which should _not_ be changed, and then used `uniq -u` to get the set difference: `cat a b b | sort | uniq -u` gives `a - b`. You could also apply this to /usr/include/ to get rid of time_t.

Joseph Quinsey 2010-06-29 06:17:18

**Edit:** If you have a recent gnu sed, in the second step rather having 26 conversions of _a to A, _b to B, etc., you can use `s/\$xxxx_.*\$_\$[a-z]\$/\1\u\2/g` to change _x to X, where x is from a to z.

Joseph Quinsey 2010-06-29 13:50:20

ansaurus

tags:

views:

answers:

Sed to remove underscores and promote character

related questions