views:

31

answers:

2

Hello all,

I have a file that uses the pipe character as a separator since I assumed that most of my clients wouldn't use the pipe character in their data. Apparently I was wrong, but I compensated for this by specifying that any fields using the pipe character need to be wrapped in double quotes. Unfortunately, they haven't done this, but I can't have them re-export their file just yet because I need their old data, so I need to manually update their old data file to properly quote the field that has pipe characters in it.

Some of the lines are correct with 16 separators and thus 16 fields whilst some of the lines have 18 separators for only 16 fields. I need to wrap the 10th field with quotation marks only if it has 18 separators instead of 16. I have a Linux box at my disposal with awk, sed, grep, etc. and would love to have a way to script this so I don't have to do it manually.

A simplified example input/output with only 5 and 7 separators (quoting the 4th field) would be:

# Input

Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|Field | with | pipes|Field 5|Field 6

# Output

Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6

# Optional output if it is easier

Field 1|Field 2|Field 3|"Field 4"|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6

Any help would be enormously appreciated!

+1  A: 

For your sample data:

sed -i '/\([^|]*|\)\{7\}/{s/\([^|]*|\)/"\1/4;s/\(|[^|]*\)/"\1/6}' inputfile

For your real data:

sed -i '/\([^|]*|\)\{17\}/{s/\([^|]*|\)/"\1/14;s/\(|[^|]*\)/"\1/16}' inputfile

Edit:

(I added a pair of missing braces in each example so the second s command (actually both) only operate when the address matches. I also removed the -n and the p. Removing the p eliminates the duplication. Sorry for the errors.)

The part before the s command is called an "address". It selects only the lines that have 7 (or 17) pipe characters, thus excluding the s command from operating on lines with a different number of pipe characters.

  • // - The delimiters for the address
  • \(\) - grouping parentheses (escaped)
  • [^|]* - zero or more (*) non-pipe (^|) characters ([] - character list delimiters)
  • | - and the pipe character we're interested in
  • \{7\} - repeat the group seven times
  • {command; command} - these braces delimit a block of commands that will be executed when the address is matched - together the address and the braces act like an if statement and its associated block

So that address matches lines that have seven groups of zero or more non-pipe characters that are each followed by a pipe character.

Then the first s command says to replace the 4th (or 14th) pipe character and its preceding non-pipe characters with a quotation mark followed by the matched characters.

The semicolon is a command separator. Some versions of sed require that, instead of semicolons, you use the `sed -e 'command' -e 'command' form for multicommand one-liner scripts.

By the way, the s is a command and not part of the regex. The part between the initial pair of slashes in the commands above and the part between the initial pair of slashes in an s/// command are regexes.

Please let me know if you have additional questions. The second s command looks for the 6th (or 16th) pipe character and the zero or more non-pipes that follow it and replaces that with itself (the matched characters) and a quotation mark.

Dennis Williamson
Thanks very much! Can you explain it a tad bit? I've never seen a RegExp that has anything before the s/ . Also, does the semicolon allow for more than one sed expression to be executed or is that just part of the RegExp? And lastly, what is the /14 and /16 for? Thanks again!
Topher Fangio
Sorry, one last question: this doesn't actually change the file, it just outputs the correct lines. When I replaced `sed -n` with `sed -i` (for inline) it worked, but it duplicated the fixed lines. How would you suggest I actually run this to fix the lines in the existing file?
Topher Fangio
@Topher: See my edit. Briefly, the `14`, `16`, `4` or `6` are qualifiers for the `s///` command that select the `nth` occurrence of the regex (between the first pair of slashes in the `s` command) to perform the substitution on. The semicolon is a command separator. Remove the `p` at the end to eliminate the duplication and change the `-n` to a `-i`. Be sure to note the addition of a pair of braces.
Dennis Williamson
+1  A: 

KISS. When you are working with distinct fields and field separators, use a tool like awk which is totally meant for the job.

$ cat file
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|Field | with | pipes|Field 5|Field 6

$ awk -F"|" 'NF>6{$4="\042"$4 ; $(NF-2)=$(NF-2)"\042";}1' OFS="|"  file
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6

the above says, if the number of fields are more than 6 (ie NF>6), then add a double quote (\042) to the 4th field, as well as the last 2nd field. (or change it accordingly depending on your data. )

No need to use complicated regular expression.

ghostdog74
+1 - Thanks for this answer. This is what I was hoping someone would post. It's quite elegant and makes a tad bit more sense to me than the sed solution above. I decided to go ahead and accept this as the solution since it is the most elegant and most correct.
Topher Fangio