views:

48

answers:

4

I'm using one-off perl -pi -e commands to do simple search and replace from within a bash script. Most of my regexes work fine, until I get to these:

perl -pi -e 's#\<\?mso-application.*\<Table.*Rows="1"\>#\<Table\>#s' 1.xml
perl -pi -e 's#\</Table.*#\</Table\>#s' 1.xml

Please don't mind the # marks instead of slashes, I didn't want to escape even more characters. These regexes are supposed to essentially delete chunks of an XML file exported from excel, but aren't working. This seems to be because I'm using logic that applies to strings, and trying to apply it to a file (though I admit I have only a basic understanding of perl's in-place editing).

Is there an alternative way to do this (whether in perl, awk, or sed) that can be issued from within a shell script?

+4  A: 

You have perl setup in line processing mode, but chances are the patterns you are trying to match span multiple lines. You will need to expand your perl scripts to read in the entire file, and then run the regexes against the entire file.

Eric Strom
+4  A: 

I would recommend that you give up the notion of editing XML files on the command line using regexes and use a proper XML parser instead.

Sinan Ünür
Understood, and generally I would, but the circumstances of this situation make it necessary, unfortunately.
Interwebs
@Interwebs: how so?
Ether
@Ether: In that I have to take the output of a tool that doesn't actually generate valid XML (that I had nothing to do with, btw) and make it valid. It's pretty much been decided that nobody's going to bother fixing the tool. And seeing as it's not valid, I can hardly parse it.
Interwebs
@Interwebs: ugh, I feel your pain. Any chance that you can stop using this broken tool and switch to something functional going forward?
Ether
+2  A: 

A couple of things:

  • Avoid using regexes to manipulate XML files because there are better tools for the job. Consider the XML::Simple or XML::Twig modules to achieve the same need.
  • Seeing that you have multiple search-and-replace operations, replace the one-liners with a proper Perl script and call that from your Bash script instead.
Zaid
+2  A: 

From the command line, add the -0777 flag to make perl read the entire file (and make sure you have the /s regex flag to make . match newlines, which you do). So:

perl -pi -0777 -e 's#\<\?mso-application.*\<Table.*Rows="1"\>#\<Table\>#s' 1.xml
perl -pi -0777 -e 's#\</Table.*#\</Table\>#s' 1.xml
ysth
Adding `/g` will also be needed if the pattern can appear more than once in a file.
Ven'Tatsu