views:

472

answers:

5

I have a bunch of files that contain a semi-standard header. That is, the look of it is very similar but the text changes somewhat.

I want to remove this header from all of the files.

From looking at the files, I know that what I want to remove is encapsulated between similar words.

So, for instance, I have:

Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point

I tried this command in perl:

perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt

But it doesn't work. I'm not a regex expert but hoping someone knows how to basically remove a chunk of text from the beginning of a file based on a text match and not the number of characters...

G-Man

A: 

Here you go! This replaces the first line of the file:


use Tie::File;

tie my @array,"Tie::File","path_to_file" or die("can't tie the file");
$array[0] =~s/text_i_want_to_replace/replacement_text/gi;
untie @array;

You can operate on the array and you will see the modifications in the array. You can delete elements from the array and it will erase the line from the file. Applying substitution on elements will substitute text from the lines.

If you want to delete the first two lines, and keep something from the third, you can do something like this :


# tie the @array before this
shift @array;
shift @array;
$array[0]=~s/foo bar\.\.\.//gi;
# untie the @array

and this will do exactly what you need!

Geo
+6  A: 

By default, ARGV (aka <> which is used behind-the-scenes by -p) only reads a single line at a time.

Workarounds:

  1. Unset $/, which tells Perl to read a whole file at a time.

    perl -pi -e "BEGIN{undef$/}s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
    

    BEGIN is necessary to have that code run before the first read is done.

  2. Use -0, which sets $/ = "\0".

    perl -pi -0 -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
    
  3. Take advantage of the flip-flop operator.

    perl -ni -e "print unless 1 ... /^Foo.bar/'
    

    This will skip printing starting from line 1 to /^Foo.bar/.

ephemient
-0 isn't as safe as -0777 which is guaranteed to put perl into slurp mode.
Chas. Owens
It's only unsafe on binary data. One would hope that *.txt are actually text files.
ephemient
Tried all three. Last one: perl -ni -e "print unless 1 ... /^Project.Gutenberg/" 00ws110.txt - still doesn't work tho. It prints nothing.
GeoffreyF67
Yes, but you never know when a stray null may wind up in a supposed text file, why take the chance when you can hit 7 three times and be safe.
Chas. Owens
Last one works for me, on three different Perl installations. Are you sure the `//` matches on the last line of the header?
ephemient
Whew. Finally got it. Thanks!
GeoffreyF67
+3  A: 

If your header stretches across more than one line you must tell perl how much to read. If the files are small in comparison to memory you may want to just slurp the whole file into memory:

perl -0777pi.orig -e 's/your regex/your replace/s' file1 file2 file3

The -0777 option sets perl to slurp mode, so $_ will hold the each whole file each time through the loop. Also, always remember to set the backup extension. If you don't you may find that you have wiped out your data accidentally and have no way to get it back. See perldoc perlrun for more information.

Given information from the comments, it looks like you are trying to strip all of the annoying stuff from the front of a Project Gutenberg ebook. If you understand all of the copyright issues involved, you should be able to get rid of the front matter like this:

perl -ni.orig -e 'print unless 1 .. /^\*END/' 00ws110.txt

The Project Gutenberg header ends with

*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*

A safer regex would take into account the *END* at the end of the line as well, but I am lazy.

Chas. Owens
+2  A: 

I might be misinterpreting what you're asking for, but it looks to me that simple:

perl -ni -e 'print unless 1..($. > 1 && /^Foo bar/)'
depesz
Or just use `1.../^Foo bar/` (notice: triple dot, not double) instead of testing `$.`.
ephemient
A: 

I'm looking for the opposite. I want to pipe the output of a linux command into a perl command line that will only display information from the beginning of the output until the first occurrence of a regular expression. Any ideas?

Pete