views:

836

answers:

6

I have a huge number of files to sort all named in some terrible convention.
Here are some examples:

(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn______.txt

These names are supposed to be a different person (speaker) each. Someone in another IT department produced these from a ton of XML files using some script but the naming is unfathomably stupid as you can see.

I need to sort literally tens of thousands of these files with multiple files of text for each person; each with something stupid making the filename different, be it more underscores or some random number. They need to be sorted by speaker.

This would be easier with a script to do most of the work then I could just go back and merge folders that should be under the same name or whatever.

There are a number of ways I was thinking about doing this.

  • parse the names from each file and sort them into folders for each unique name.
  • get a list of all the unique names from the filenames, then look through this simplified list of unique names for similar ones and ask me whether they are the same, and once it has determined this it will sort them all accordingly.

I plan on using Perl, but I can try a new language if it's worth it. I'm not sure how to go about reading in each filename in a directory one at a time into a string for parsing into an actual name. I'm not completely sure how to parse with regex in perl either, but that might be googleable.

For the sorting, I was just gonna use the shell command:

`cp filename.txt /example/destination/filename.txt`

but just cause that's all I know so it's easiest.

I dont even have a pseudocode idea of what im going to do either so if someone knows the best sequence of actions, im all ears. I guess I am looking for a lot of help, I am open to any suggestions. Many many many thanks to anyone who can help.

B.

+2  A: 

I've not used Perl in a while so I'm going to write this in Ruby. I will comment it to establish some pseudocode.

DESTINATION = '/some/faraway/place/must/exist/and/ideally/be/empty'

# get a list of all .txt files in current directory
Dir["*.txt"].each do |filename|
  # strategy:
  # - chop off the extension
  # - switch to all lowercase
  # - get rid of everything but spaces, dashes, letters, underscores
  # - then swap any run of spaces, dashes, and underscores for a single space
  # - then strip whitespace off front and back
  name = File.basename(filename).downcase.
         gsub(/[^a-z_\s-]+/, '').gsub(/[_\s-]+/, ' ').strip
  target_folder = DESTINATION + '/' + name

  # make sure we dont overwrite a file
  if File.exists?(target_folder) && !File.directory?(target_folder)
    raise "Destination folder is a file"
  # if directory doesnt exist then create it
  elsif !File.exists?(target_folder)
    Dir.mkdir(target_folder)
  end
  # now copy the file
  File.copy(filename, target_folder)
end

That's the idea, anyway - I've made sure all the API calls are correct, but this isn't tested code. Does this look like what you're trying to accomplish? Might this help you write the code in Perl?

wuputah
+1  A: 

You can split the filenames using something like

@tokens = split /_+/, $filename

The last entry of @tokens should be ".txt" for all of these filenames, but the second-to-last should be similar for the same person whose name has been misspelled in places (or "Dr. Jones" changed to "Brian Jones" for instance). You may want to use some sort of edit distance as a similarity metric to compare @tokens[-2] for various filenames; when two entries have similar enough last names, they should prompt you as a candidate for merging.

ruds
+2  A: 

Are all the current files in the same directory? If that is the case then you could use 'opendir' and 'readdir' to read through all the files one by one. Build a hash using the file name as the key (remove all '_' as well as any information inside the brackets) so that you get something like this -

(4)_mr__mcloughlin____.txt -> 'mr mcloughlin'
12__sir_john_farr____.txt -> 'sir john farr'
(b)mr__chope____.txt -> 'mr chope'
dame_elaine_kellett-bowman____.txt -> 'dame elaine kellett-bowman'
dr__blackburn______.txt -> 'dr blackburn'

Set the value of the hash to be the number of instances of the name occurred so far. So after these entries you should have a hash that looks like this -

'mr mcloughlin' => 1
'sir john farr' => 1
'mr chope' => 1
'dame elaine kellett-bowman' => 1
'dr blackburn' => 1

Whenever you come across a new entry in your hash simply create a new directory using the key name. Now all you have to do is copy the file with the changed name (use the corresponding hash value as a suffix) into the new directory. So for eg., of you were to stumble upon another entry which reads as 'mr mcloughlin' then you could copy it as

./mr mcloughlin/mr mcloughlin_2.txt
muteW
+4  A: 

I hope I understand your question right, it's a bit ambiguous IMHO. This code is untested, but should do what I think you want.

use File::Copy;

sub sanatize {
    local $_ = shift;
    s/dame|dr|mr|sir|\d+|\(\w+\)|.txt$//g;
    s/[ _]+/ /g;
    s/^ | $//g;
    return lc $_;
}

sub sort_files_to_dirs {
    my @files = @_;
    for my $filename (@files) {
        my $dirname = sanatize($filename);
        mkdir $dirname if not -e $dirname;
        copy($filename, "$dirname/$filename");
    }
}
Leon Timmermans
Not sure if it's needed, but this will not preserve titles and dashes in names. You could use this instead: s/([a-zA-Z-_]+).txt$/g Also, I think you have to create directories before you copy into them. Either use mkpath or fcopy.
drby
Added the directory creation. Stripping out titles is deliberate. It will preserve the dashes in the name. Your character class is not correct, because '-' has special meaning inside one. You should either escape it, or put it as first/last in the character class, else it won't work as advertised.
Leon Timmermans
I actually wrote the script and tested it. The regex worked (though my script looked a little different). Just try it yourself: my $string = "dame_elaine_kellett-bowman____.txt";if($string =~ m/([a-zA-Z-_]+).txt$/) {print $1;}
drby
This basically worked just right, I obviously still have to go through a couple thousand folders for similar names to merge, but its saved me a ton of work, which was the point I guess. Too bad I dont have time to sit down and make it really effective with some "intelligence"
gnomed
+2  A: 

I would:

  1. define what's significant in the name:

    • is dr__blackburn different than dr_blackburn?
    • is dr__blackburn different than mr__blackburn?
    • are leading numbers meaningful?
    • are leading/trailing underscores meaningful?
    • etc.
  2. come up with rules and an algorithm to convert a name to a directory (Leon's is a very good start)

  3. read in the names and process them one at a time

    • I would use some combination of opendir and recursion
    • I would copy them as you process them; again Leon's post is a great example
  4. if this script will need to be maintained and used in the future, I would defintely create tests (e.g. using http://search.cpan.org/dist/Test-More/) for each regexp path; when you find a new wrinkle, add a new test and make sure it fails, then fix the regex, then re-run the test to make sure nothing broke

Joe Casadonte
+1  A: 

As you are asking a very general question, any language could do this as long as we have a better codification of rules. We don't even have the specifics, only a "sample".

So, working blind, it looks like human monitoring will be needed. So the idea is a sieve. Something you can repeatedly run and check and run again and check again and again until you've got everything sorted to a few small manual tasks.

The code below makes a lot of assumptions, because you pretty much left it to us to handle it. One of which is that the sample is a list of all the possible last names; if there are any other last names, add 'em and run it again.

use strict;
use warnings;
use File::Copy;
use File::Find::Rule;
use File::Spec;
use Readonly;

Readonly my $SOURCE_ROOT    => '/mess/they/left';
Readonly my $DEST_DIRECTORY => '/where/i/want/all/this';

my @lname_list = qw<mcloughlin farr chope kelette-bowman blackburn>;
my $lname_regex 
    = join( '|'
          , sort {  ( $b =~ /\P{Alpha}/ ) <=> ( $a =~ /\P{Alpha}/ )
                 || ( length $b ) <=> ( length $a ) 
                 || $a cmp $b 
                 } @lname_list 
          )
    ;
my %dest_dir_for;

sub get_dest_directory { 
    my $case = shift;
    my $dest_dir = $dest_dir_for{$case};
    return $dest_dir if $dest_dir;

    $dest_dir = $dest_dir_for{$case}
        = File::Spec->catfile( $DEST_DIRECTORY, $case )
        ;
    unless ( -e $dest_dir ) { 
        mkdir $dest_dir;
    }
    return $dest_dir;
}

foreach my $file_path ( 
    File::Find::Rule->file
        ->name( '*.txt' )->in( $SOURCE_ROOT )
) {
    my $file_name =  [ File::Spec->splitpath( $file_path ) ]->[2];
    $file_name    =~ s/[^\p{Alpha}.-]+/_/g;
    $file_name    =~ s/^_//;
    $file_name    =~ s/_[.]/./;

    my ( $case )  =  $file_name =~ m/(^|_)($lname_regex)[._]/i;

    next unless $case;
    # as we next-ed, we're dealing with only the cases we want here. 

    move( $file_path
        , File::Spec->catfile( get_dest_directory( lc $case )
                             , $file_name 
                             )
        );
}
Axeman