A: 

Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?

Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.

Tobias
The OCR is good, so they are for all practical purposes identical. See comment on OP for comment about what marks the beginning of each article proper, i used all 300 char. there. :)
fdsayre
A: 

Post a sample file, or at least the top and bottom of a file so we can see what we're dealing with. An example is worth 1000^2 words.

David Plumpton
A: 

Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.

Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)

anon
The title and author's name are usually on separate, following lines in all caps. Unfortunately these do not always mark the beginning of the article, for example, some article's (reviews) do not have title/author names, so something with DIFF would probably work best.
fdsayre
+2  A: 
Norman Ramsey
Well I'm glad I didn't miss some obvious answer but "nontrivial" doesn't sound good. This script looks good. Unfortunately it exits with "no overlap" right now. I've uploaded a sample of files to: http://dl.getdropbox.com/u/239647/bul_9_5_181.txthttp://dl.getdropbox.com/u/239647/bul_9_5_186.txt
fdsayre
Two issues: your files are inconstent in how they use ^L, and my overlap detector needs to be improved. How long do the files get?
Norman Ramsey
The largest file will be around 250KB, but that is abnormal. The vast majority are under 100KB. The mean is probably 30KB. These are all academic articles, so while a few are large reports, most are a couple pages. Thanks.
fdsayre
OK, I have improved things to the point where it works on your two test files. It finds at most 40 lines of overlap. Let me know how it goes....
Norman Ramsey
This looks really good. Just tested on a few files, but will give it a workout tonight/tomorrow morning. THANKS.
fdsayre
Strange. Works fine on directory with a couple dozen files, when scaling does:<pre>lua: ./s1.lua:71: attempt to index field '?' (a nil value)stack traceback: ./s1.lua:71: in function 'split_overlap' ./s1.lua:88: in function 'strip_overlaps' ./s1.lua:102: in main chunk [C]: ?</pre>
fdsayre
This looks _way_ more complicated than it needs to be.
MarkusQ
@fdsayre: my bad, line 68 should loop to #l-1, not to #l
Norman Ramsey
@MarkusQ: Strangers solve problems for free writing code late at night, and all you can do is complain? :-) Test your code, then we'll talk :-)
Norman Ramsey
@Norman Ramsey -- I wasn't complaining, just kibitzing. When I'm writing code for free late at night I always like to go with the simplest solution possible,
MarkusQ
I just like people who write code late at night! @markusQ I'm testing this now.
fdsayre
Opps... @ramsey: I am testing this now
fdsayre
@MarkusQ: You're a better man. When it's late at night, I can't make anything simple. The 'have I seen it before' test is ingenious. Wrong, but it will never be wrong on a real input.
Norman Ramsey
@Norman Ramsey -- My program is now tested (and an embarrassing typo fixed) as per your guilt trip above. Thanks for goading me.
MarkusQ
@MarkusQ Yeah, the slashes confused me at first too.
fdsayre
I'm executing this as ./s1.lua *.txt is that correct?
fdsayre
Weird... the script doesn't seem to be overwriting/changing the files anymore...
fdsayre
Re:slashes. I'm multitasking and evidently don't have good enough wetware trapping of cross project memory contamination.
MarkusQ
Ramsey: returning "lua: ./s1.lua:49: '=' expected near 'for'" would love to get this to work as so far this script returns best results.
fdsayre
@Sayre: maybe we have a transcription error -- what's on line 49? I've put a current version in http://www.cs.tufts.edu/~nr/drop/rm-overlaps. Maybe you can post a zip file containing your texts? Or are they proprietary?
Norman Ramsey
Added two volumes of data at http://drop.io/fdsayre . the data in vol1 seems to run fine (although the script doesn't seem to constantly change the original files, although maybe I'm executing it wrong. the data in vol27 seems to exit with the same code above.
fdsayre
@Ramsey Thanks for all your help.
fdsayre
+3  A: 

You should probably try something like this (I've now tested it on the sample data you provided):

#!/usr/bin/ruby

class A_splitter
    Title   = /^[A-Z]+[^a-z]*$/
    Byline  = /^BY /
    Number = /^\d*$/
    Blank_line = /^ *$/
    attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
    def initialize(src_glob,dst_path=nil)
        @recent_lines = []
        @seen_in_last_file = {}
        @in_references = false
        @source_glob = src_glob
        @destination_path = dst_path
        @destination = STDOUT
        @buffer = []
        split_em
        end
    def split_here
        if destination_path
            @destination.close if @destination
            @destination = nil
          else
            print "------------SPLIT HERE------------\n" 
          end
        print recent_lines.shift
        @in_references = false
        end
    def at_page_break
        ((recent_lines[0] =~ Title  and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
         (recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
        end
    def print(*args)
        (@destination || @buffer) << args
        end
    def split_em
        Dir.glob(source_glob).sort.each { |filename|
            if destination_path
                @destination.close if @destination
                @destination = File.open(File.join(@destination_path,filename),'w')
                print @buffer
                @buffer.clear
              end
            in_header = true
            File.foreach(filename) { |line|
                line.gsub!(/\f/,'')
                if in_header and seen_in_last_file[line]
                    #skip it
                  else 
                    seen_in_last_file.clear if in_header
                    in_header = false
                    recent_lines << line
                    seen_in_last_file[line] = true
                  end
                3.times {recent_lines.shift} if at_page_break
                if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
                    split_here
                  elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
                    split_here
                  elsif recent_lines.length > 4
                    @in_references ||= recent_lines[0] =~ /^REFERENCES *$/
                    print recent_lines.shift
                  end
                }
            } 
        print recent_lines
        @destination.close if @destination
        end
    end

A_splitter.new('bul_*_*_*.txt','test_dir')

Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.

It also removes the page-break sections (journal title, author, and page number).

It does two split tests:

  • a test on the title/byline pair
  • a test on the first title-line after a reference section

(it should be obvious how to add tests for additional split-points).

Retained for posterity:

If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with

csplit -f abstracts - '---SPLIT HERE---' '{*}'

or something) to cut it up.

MarkusQ
This looks interesting. if by "split" you mean keeping the files separate then I do need them split. I did a quick test and it seems to work, but without keeping each article intact its difficult to compare. Thanks.
fdsayre
I'll update it to do the split too.
MarkusQ
So this uses title and byline to determine the proper start (and thus end) of each article? If so, unfortunately it won't work as there is no specific field that uniquely identifies the starting point of every article. Some use title/by but others (reviews/etc.) do not have an author field.
fdsayre
Thus i believe that the script needs to compare the beginings/end of each file with the next file and remove 1 set of duplicates. The stubs should only be in the approx. 1/2 page on either side and only with the previous/next file. Sorry this is so complex, it may not be solvable.
fdsayre
Clever to ignore order on the duplicate lines. Works with high probability.
Norman Ramsey
@fdsayre -- It could be any number of tests; the title and byline are just examples. I'll add another pattern I've noticed, as an example.
MarkusQ
Okay, so just so I understand, title/byline/ref are only used to determine splits, not to determine what to remove, right? the actual removal is done via a test for duplicate lines between files. I have no experience with ruby (or lua) but it looks easy to add split points.
fdsayre
ahhh. sorry man. I need to keep the original filenames intact, and as far as I can tell thats impossible with the split... damn. Thanks for all your help.
fdsayre
@fdsayre -- Impossible? No. Not even difficult. Edit coming...
MarkusQ
Are the split tests required if the script outputs to the original filename? I ask because the splits happen in many different ways (no coherent pattern) and right now the outputted files are not working with high probability...
fdsayre
Yes, the split tests are needed. If there is no coherent pattern (which I doubt, having spotted two patterns in the data you provided) you are out of luck, since there's no way to automate such a task. Post an example of something that doesn't split right and I'll see if I can spot a pattern.
MarkusQ
@MarkusQ Can I send you some data via email?
fdsayre
@fdsayre -- It would be better if you could just post them somewhere--that way others could see them too.
MarkusQ
Added two volumes of data at http://drop.io/fdsayre
fdsayre
@fdsayre -- I see them, but it doesn't help much. The problem is you are want ing it to split at some points that apparently aren't obvious, but what those points are _isn't_obvious_.
MarkusQ
@fdsayre -- Maybe if you could find some places where you think it should be split and post several of those (as an edit to the question) someone could spot the additional pattern(s). But the raw data doesn't help much without knowing what you are wanting it to do.
MarkusQ
@fdsayer -- Also, at least some of those appear (to me at least) to be one file per page, not one file per article, so it may be that you are wanting to _join_ files as well as split them.
MarkusQ
The split thing is a problem. I don't think there are any coherent split patterns between all files. The "stems" are because many articles end and/or start mid page and when that happens that page is duplicated in both articles original files.
fdsayre
Unfortunately I need to keep the original file names/content intact (this is a more important requirement that 100% accuracy on the removal of duplicate data, or for that matter, which copy of the duplicate data is removed).
fdsayre
It seems to be a complex problem but I originally thought group programming tools (diff, etc.) would help. I may have to deal with this problem statistically by estimating the amount of duplicate data and correcting my results appropriately, but obviously I would rather proceed empirically.
fdsayre
@MarkusQ The only thing I can think of - if removing the duplicate data and the split process are separate - is adding a split point to the top of each file before running the script, which would allow putting the files back together again with perfect accuracy once the dup. data is removed.
fdsayre
@MarkusQ Thanks for all your help
fdsayre
+2  A: 

Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.

You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.

The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.

I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.

Here's the script:

#!/usr/bin/perl
use strict;

my @files;
my $count = @ARGV;
if ($count>0){
    open (IN, "$ARGV[0]");
    @files = <IN>;
    close (IN);
} else {
    @files = glob "bul_*.txt";
}
$count = @files;
print "Processing $count files.\n";

my $lastFile="";
foreach(@files){
    if ($lastFile ne ""){
     print "Processing $_\n";
     open (FILEB,"$_");
     my @fileBLines = <FILEB>;
     close (FILEB);
     my $line0 = $fileBLines[0];
            if ($line0 =~ /\(/ || $line0 =~ /\)/){
                    $line0 =~ s/\(/\\\(/;
                    $line0 =~ s/\)/\\\)/;
            }
     my $line1 = $fileBLines[1];
     my $line2 = $fileBLines[2];
     open (FILEA,"$lastFile");
     my @fileALines = <FILEA>;
     close (FILEA);
     my $newName = "$lastFile.new";
     open (OUT, ">$newName");
     my $i=0;
     my $done = 0;
     while ($done != 1 and $i < @fileALines){
      if ($fileALines[$i] =~ /$line0/ 
       && $fileALines[$i+1] == $line1
       && $fileALines[$i+2] == $line2) {
       $done=1;
      } else {
       print OUT $fileALines[$i];
       $i++;
      }
     }
     close (OUT);
    }
    $lastFile = $_;
}

EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.

dagorym
this looks really good and worked when tested on a small sample. any advice for quickly generating the list? I'm playing with sort and find but they really don't seem to like the fields, especially the 11 and 1 in the second field.
fdsayre
got it: sort -t "_" -n -k2,2 -k3,3 -k4,4
fdsayre
error: Processing bul_2_6_200.txtUnmatched ) in regex; marked by <-- HERE in m/PROCEEDINGS OF THE MEETING OF THE NORTH CENTRAL SECTK) <-- HERE N OF THE AMERICAN PSYCHOLOGICAL ASSOCIATION./ at ./x line 34.
fdsayre
line 34 = "if ($fileALines[$i] =~ /$line0/"
fdsayre
I'll take a look at it. Is that file (bul_2_6_200.txt) and it's proceeding file in the sets of files you provided?
dagorym
I just checked and it doesn't look like they are. It would be useful to have the files that caused the error to try to diagnose the problem.
dagorym
Don't need the files. Closer inspection revealed that the problem is coming from the first line of the second file having an unmatched parenthesis. Added a bit of code after the assignment to the $line0 variable to escape the () to prevent the problem. Try it now.
dagorym
Added file (along with some neighbours) to http://drop.io/fdsayre as more-test-material.tar.gz Thanks... I've looked at the file but cannot see anything that should be causing this problem.
fdsayre
Would it help to just remove all punctuation before running the script? I don't really need punctuation, all I am interested in is the word frequency.
fdsayre
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE 4 8/ at ./x line 38.
fdsayre
never mind. Some weird characters in some of these files, but only a couple and the script's output makes it easy to fix these by hand. Am testing how well it works now. Thanks
fdsayre
Fantastic. Thank you so much. I'll cite back to this page in bibliography unless you would rather I cite to another page/person. Mind if I contact you just to clarify what this script does (I should probably understand how it works and not just that it works)
fdsayre
Yes you can contact me at [email protected]. If you want, I can send you a heavily commented version describing what is happening and why.
dagorym
A: 

A quick stab at it, assuming that the stub is strictly identical in both files:

#!/usr/bin/perl

use strict;

use List::MoreUtils qw/ indexes all pairwise /;

my @files = @ARGV;

my @previous_text;

for my $filename ( @files ) {
    open my $in_fh,  '<', $filename          or die;
    open my $out_fh, '>', $filename.'.clean' or die;

    my @lines = <$in_fh>;
    print $out_fh destub( \@previous_text, @lines );
    @previous_text = @lines;
}


sub destub {
    my @previous = @{ shift() };
    my @lines = @_;

    my @potential_stubs = indexes { $_ eq $lines[0] } @previous;

    for my $i ( @potential_stubs ) {
        # check if the two documents overlap for that index
        my @p = @previous[ $i.. $#previous ];
        my @l = @lines[ 0..$#previous-$i ];

        return @lines[ $#previous-$i + 1 .. $#lines ]
                if all { $_ } pairwise { $a eq $b } @p, @l;

    }

    # no stub detected
    return @lines;
}
Yanick
+4  A: 

It looks like a much simpler solution would actually work.

No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.

The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.

Here's an outline for an algorithm

  1. choose a file; call it: file1
  2. look at the filename of the next file; call it: file2
  3. extract the page number from the filename of file2; call it: pageNumber
  4. scan the contents of file1 until you find a line that contains only pageNumber
  5. make sure this line is preceded and followed by a blank line.
  6. remove this line and everything after
  7. move on to the next file in the sequence
Waylon Flinn
Huh. Last night I realized the filenames were useful in this context, but I hadn't thought about the page numbers located WITHIN the file. Thats rather clever.
fdsayre
recomment off OP: "Yeah... I just realized that. The ironic thing is that I already extract all the page numbers to use a meta data anyway and they are just sitting in text files and a database linked with the file names and content."
fdsayre