ansaurus

Question

How can I validate large numbers of files with search and replace?

Answer 1

+3 A:

Try this. It'll go through your files, make a .orig backup of each file (perl's -i operator), and replace <img> and <input> tags with <img /> and <input >.

find . \! -path '*.svn*' -type f -exec perl -pi.orig -e 's{ ( <(?:img|input)\b ([^>]*?) ) \ ?/?> }{$1\ />}sgxi' {} \;

Given input:

<img>  <img/>  <img src="..">  <img src="" >
<input>  <input/>  <input id="..">  <input id="" >

It changes the file to:

<img />  <img />  <img src=".." />  <img src="" />
<input />  <input />  <input id=".." />  <input id="" />

Here's what the regexp is doing:

s{(<(?:img|input)\b ([^>]*?)) # capture "<img" or "<input" followed by non-">" chars
  \ ?/?>}                     # optional space, optional slash, followed by ">"
{$1\ />}sgxi                  # replace with: captured text, plus " />"

Anirvan 2008-10-28 06:15:29

Answer 2

A:

See questions I asked in comment at top.

Assuming you're using GNU sed, and that you're trying to add the trailing / to your tags to make XML-compliant <img /> and <input />, then replace the sed expression in your command with this one, and it should do the trick: '1h;1!H;${;g;s/$img\|input$$ [^>]*[^/]$>/\1\2\/>/g;p;}'

Here it is on a simple test file (SO's colorizer doing wacky things):

$ cat test.html
This is an <img tag> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag > without closing slash.
And here one <input attrib="1" 
    > that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

$ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html
This is an <img tag/> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag /> without closing slash.
And here one <input attrib="1" 
    /> that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

Here's GNU sed regex syntax and how the buffering works to do multiline search/replace.

Alternately you could use something like Tidy that's designed for sanitizing bad HTML -- that's what I'd do if I were doing anything more complicated than a couple of simple search/replaces. Tidy's options get complicated fast, so it's usually better to write a script in your scripting language of choice (Python, Perl) that calls libtidy and sets whatever options you need.

joelhardi 2008-10-28 06:16:18

ansaurus

tags:

views:

answers:

How can I validate large numbers of files with search and replace?

related questions