views:

194

answers:

3

Hi.

I have 2 files that are generated elsewhere. First one is "what to search", and second one is the replacement. Both files are huge, about 2-3mb each.

I need to write a bash script that takes an even bigger file (about 200-300mb) and replaces all occurrences of file1 contents to file2 contents.

Problem is, file1 and file2 can contain any possible characters, including regexp special symbols.

How can I solve this problem using sed?

Thanks in advance.

A: 

I don't know about sed but in Perl you could do (off the top of my head, untested):

perl -0777 -pe 'BEGIN{local $/ = undef; open FROM, "<", shift @ARGV; $from = <FROM>; open TO, "<" shift @ARGV; $to = <TO>} s/\Q$from\E/$to/sog' file1 file2 bigger-file > new-bigger-file

If you're interesting in trying Perl, I could try testing it for you tomorrow.

But it sucks the entire bigger-file into memory because it ignores line-breaks so that your search text can span multiple lines. This will meant that it uses quite a lot of memory!

This answer assumes that the search file is one long search string over multiple lines which must be matched in its entirety rather than a number of separate search strings, any of which can be matched.

Adrian Pronk
Yes, that's why I thought sed was the optimal solution as it does not need to load everything into memory, it operates on streams.
Max
Well Perl can operate line-by-line just like sed but that isn't useful if you're replacing 2-3mb chunks at a time which is presumably more than 1 line.
Adrian Pronk
+1  A: 

Since you don't actually need regular expressions, just direct string matching, sed is overkill. What you're really looking for is a fixed-string (maybe even binary) stream editor. Unfortunately, I don't know of one... I hate to suggest possibly reinventing a wheel, but you could write something fairly quickly in C that'd do what you want. A rough draft outline:

  • read search-file into memory
  • create a buffer of the same size as search-file
  • read from stdin (or input-file) into buffer.
    • For each character, if it does not match the parallel character from search-file, shift the buffer. To find out how much to shift it by, read until you find a match to the first character of input-file, then check to see if the rest matches, repeating until you've found a partial match to input-file (or gotten to the end of the buffer). When you shift, print all the non-matching characters to stdout (or output-file)
    • If the buffer ever fills up, i.e. totally matches input-file, print replacement-file to stdout (or output-file). Depending on memory vs. speed, you can keep replacement-file in memory or read it from disk each time.

You could also attempt to automatically escape all regex characters from your input file. This could be done with a horribly ugly list of sed substitutions, like

sed -e 's/\\/\\\\/g' -e 's@/@\/@' -e 's/\[/\\[/g' ...

(make sure you do the \ one first!)

Jefromi
+1  A: 

Maybe have a look at chgrep:

http://www.bmk-it.com/projects/chgrep/

Cheers,

gregx