views:

430

answers:

11

I have this use case of an xml file with input like

Input:
<abc a="1">
   <val>0.25</val>
</abc> 
<abc a="2">
    <val>0.25</val>
</abc> 
<abc a="3">
   <val>0.35</val>
</abc> 
 ...

Output:
<abc a="1"><val>0.25</val></abc> 
<abc a="2"><val>0.25</val></abc>
<abc a="3"><val>0.35</val></abc>

I have around 200K lines in a file in the Input format, how can I quickly convert this into output format.

A: 

inelegant perl one-liner which should do the trick, though not particularly quickly.

cat file | perl -e '
    $x=0;
    while(<>){
        s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/g;
        print;
        $x++;
    if($x==3){
        print"\n";
        $x=0;
    }
}' > output
Mimisbrunnr
Instead of `cat file`, just use `<file`.
Arkku
@Arkku - would work just as well. It's an old habit of mine, and I'm more comfortable with cat $FILE |
Mimisbrunnr
@Mimisbrunnr: It fires up a useless `cat`, though. On some highly restricted systems there's a low limit on the number of simultaneous processes which it count towards. Also, it can be a significant slowdown if the process itself is a fast reader, e.g. try `cat /dev/zero | dd bs=1k count=1000` vs `dd bs=1k count=1000 </dev/zero`. I get 7.5MB/s with `cat` and 32.7MB/s without.
Arkku
(As a real-world example I've encountered, the parsing of a multi-gigabyte file cat-ed to the parser by a person also habitually using `cat |` proved to be a major slowdown in the process… =)
Arkku
@Arkku - Noted, thanks, I will discontinue the use of cat except where actually needed.
Mimisbrunnr
A: 

You can do this:

perl -e '$i=1; while(<>){chomp;$s.=$_;if($i%3==0){$s=~s{>\s+<}{><};print "$s\n";$s="";}$i++;}' file
codaddict
chomp is no good because it leaves behind too much whitespace, unless our asker is okay with that.
Mimisbrunnr
@Mimisbrunnr: if you look carefully I use a regex to get rid of the extra spaces.
codaddict
@codaddict - I apologies, I spoke before fully reading your code.
Mimisbrunnr
that also depends on whether the ending pattern `</abc>` is always 2 lines after `<abc>`. why not grab for the actual pattern?
+1  A: 
$ awk '
    /<abc/ && NR > 1 {print ""}
    {gsub(" +"," "); printf "%s",$0}
' file
<abc a="1"> <val>0.25</val></abc>
<abc a="2"> <val>0.25</val></abc>
<abc a="3"> <val>0.35</val></abc>
ghostdog74
+1 You'll also want: `END {print ""}` to ensure the file ends with a newline.
glenn jackman
A: 
sed '/<abc/,/<\/abc>/{:a;N;s/\n//g;s|<\/abc>|<\/abc>\n|g;H;ta}'  file
A: 
tr "\n" " "<myfile|sed 's|<\/abc>|<\/abc>\n|g;s/[ \t]*<abc/<abc/g;s/>[ \t]*</></g'
+2  A: 

In vim you could do this with

:g/<abc/ .,/<\/abc/ join

This would leave a space between some of the elements, which you could then remove with

:%s/> *</></g

In general I would recommend using a proper XML parsing library in a language like Python, Ruby or Perl for manipulating XML files (I recommend Python+ElementTree), but in this case it is simple enough to get away with using a regex solution.

Dave Kirby
+1  A: 

Bash:

while read s; do echo -n $s; read s; echo -n $s; read s; echo $s; done < file.xml
pazhitnov
+1  A: 

You can record a macro. Basically what I would do is begin with my cursor at the start of the first line. Press 'qa' (records macro to the a register). The press shift-V to being line-wise visual mode. Then search for the ending tag '/\/abc'. Then press shift-J to join the lines. Then you would have to move the cursor to the next tag, probably with 'j^' and press 'q' to stop recording. You can then rerun the recording with '@a' or specify 10000@a if you like. If the tags are different or not right after each other you just need to change how you find the opening and closing tags to searches or something like that.

Neg_EV
Obviously this is a vim based solution...
Neg_EV
+1  A: 

In Vim:

  • position on first line
  • qq: start recording macro
  • gJgJ: joins next two lines without adding spaces
  • j: go down
  • q: stop recording
  • N@q: N = number of lines (actually around 1/3rd of all lines as they get condensed on the go)
kemp
A: 

This should work in ex mode:

:%s/\(^<abc.*>\)^M^\(.*\)^M^\(^<\/abc>\).*^M/\1\2\3^M/g

I should have extra spaces (or a tab in between the value), but you coud remove it depending on what it is (\t or \ \ \ \ ).

What you are searching/replacing is here is (pattern1)[enter](pattern2)[enter](pattern3)[enter] and replacing it with (pattern1)(pattern2)(pattern3)[enter]

The ^M is done with ctrl+v CTRL+m

+1  A: 
sed '/^<abc/{N;N;s/\n\| //g}'

# remove \n or "space" 
# Result

<abca="1"><val>0.25</val></abc>
<abca="2"><val>0.25</val></abc>
<abca="3"><val>0.35</val></abc>