views:

35

answers:

3

Hello, I'm working on #huge# text files (from 100mb to 1gb), I have to parse them to extract some particoular data. The annoying thing is that the files have not a clearly defined separator.

For example:

"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

I have to delete the white spaces in strings limited by " (quote), the problem is that I must not erase the white spaces "outside" the quotes (otherwise some numbers would merge). I can't find a decent sed solution, can someone help me with this?

+1  A: 

I can't come up with a sed solution, however you might be better off just writing a small application to do this.

#include <iostream>
#include <string>
using namespace std;

int main() {
    string line;
    while(getline(cin,line)) {
        bool inquot = false;
        for(string::iterator i = line.begin(); i != line.end(); i++) {
            char c = *i;
            if (c == '"') inquot = !inquot;

            if (c != ' ' || !inquot) cout << c;
        }
        cout << endl;
    }
    return 0;
}

Then go

./a.out < test.log > new.out

DISCLAIMER

This will completely choke if you have escaped quotes on lines or multiline things within quotes.

For instance "The word \"word\" is weird" and things to that effect will cause problems

Jamie Wong
+1  A: 

Like Jamie, I don't think sed is good for the job. It could be that my sed skill is not good enough for the job. Here is a solution which essentially the same as Jamie's, but in Python:

#!/usr/bin/env python

# Script to delete spaces within the double quotes, but not outside.

QUOTE = '"'
SPACE = ' '

file = open('data', 'r')
for line in file:
    line = line.rstrip('\r\n')
    newline = ''
    inside_quote = False
    for char in list(line):
        if char == QUOTE:
            inside_quote = not inside_quote
        if not (char == SPACE and inside_quote):
            newline += char
    print(newline)
file.close()

Save this script to a file, say rmspaces.py. You can then invoke the script from the command line:

python rmspaces.py

Note that the script assumes the data is in a file called data. You can modify the script to taste.

Hai Vu
+3  A: 

you use awk, not sed. And there's certainly no need to create your own C program, as awk is already an excellent C program to do file processing, even on GB files. So here's a one liner to do the job.

$ more file
"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

$ awk -F'"' '{for(i=2;i<=NF;i+=2) {gsub(/ +/,"",$i)}}1' OFS='"' file
"element" 123124 16758 "12.4" "element" "elementwithwhitespacesinside" "element"
ghostdog74
That solved my problem. Just a last request, can you please explain me the code? Thank you very much (I'm not that familiar with awk)
Abaco
by setting double quotes as the field separator, those words inside the quotes have field numbers that are even. Hence the `i` counter increases by 2. `gsub()` substitutes all spaces to null. Pls read the gawk manual (search GNU awk) for more info
ghostdog74
+1 very clever solution.
Hai Vu