You should probably try something like this (I've now tested it on the sample data you provided):
#!/usr/bin/ruby
class A_splitter
Title = /^[A-Z]+[^a-z]*$/
Byline = /^BY /
Number = /^\d*$/
Blank_line = /^ *$/
attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
def initialize(src_glob,dst_path=nil)
@recent_lines = []
@seen_in_last_file = {}
@in_references = false
@source_glob = src_glob
@destination_path = dst_path
@destination = STDOUT
@buffer = []
split_em
end
def split_here
if destination_path
@destination.close if @destination
@destination = nil
else
print "------------SPLIT HERE------------\n"
end
print recent_lines.shift
@in_references = false
end
def at_page_break
((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
(recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
end
def print(*args)
(@destination || @buffer) << args
end
def split_em
Dir.glob(source_glob).sort.each { |filename|
if destination_path
@destination.close if @destination
@destination = File.open(File.join(@destination_path,filename),'w')
print @buffer
@buffer.clear
end
in_header = true
File.foreach(filename) { |line|
line.gsub!(/\f/,'')
if in_header and seen_in_last_file[line]
#skip it
else
seen_in_last_file.clear if in_header
in_header = false
recent_lines << line
seen_in_last_file[line] = true
end
3.times {recent_lines.shift} if at_page_break
if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
split_here
elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
split_here
elsif recent_lines.length > 4
@in_references ||= recent_lines[0] =~ /^REFERENCES *$/
print recent_lines.shift
end
}
}
print recent_lines
@destination.close if @destination
end
end
A_splitter.new('bul_*_*_*.txt','test_dir')
Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.
It also removes the page-break sections (journal title, author, and page number).
It does two split tests:
- a test on the title/byline pair
- a test on the first title-line after a reference section
(it should be obvious how to add tests for additional split-points).
Retained for posterity:
If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less
the output) and when you want them in individual files just pipe it to csplit
(e.g. with
csplit -f abstracts - '---SPLIT HERE---' '{*}'
or something) to cut it up.