ansaurus

Question

How would I sort files to directories based on filenames?

Answer 1

+2 A:

I've not used Perl in a while so I'm going to write this in Ruby. I will comment it to establish some pseudocode.

DESTINATION = '/some/faraway/place/must/exist/and/ideally/be/empty'

# get a list of all .txt files in current directory
Dir["*.txt"].each do |filename|
  # strategy:
  # - chop off the extension
  # - switch to all lowercase
  # - get rid of everything but spaces, dashes, letters, underscores
  # - then swap any run of spaces, dashes, and underscores for a single space
  # - then strip whitespace off front and back
  name = File.basename(filename).downcase.
         gsub(/[^a-z_\s-]+/, '').gsub(/[_\s-]+/, ' ').strip
  target_folder = DESTINATION + '/' + name

  # make sure we dont overwrite a file
  if File.exists?(target_folder) && !File.directory?(target_folder)
    raise "Destination folder is a file"
  # if directory doesnt exist then create it
  elsif !File.exists?(target_folder)
    Dir.mkdir(target_folder)
  end
  # now copy the file
  File.copy(filename, target_folder)
end

That's the idea, anyway - I've made sure all the API calls are correct, but this isn't tested code. Does this look like what you're trying to accomplish? Might this help you write the code in Perl?

wuputah 2009-02-16 07:58:22

Answer 2

+1 A:

You can split the filenames using something like

@tokens = split /_+/, $filename

The last entry of @tokens should be ".txt" for all of these filenames, but the second-to-last should be similar for the same person whose name has been misspelled in places (or "Dr. Jones" changed to "Brian Jones" for instance). You may want to use some sort of edit distance as a similarity metric to compare @tokens[-2] for various filenames; when two entries have similar enough last names, they should prompt you as a candidate for merging.

ruds 2009-02-16 08:12:13

Answer 3

+2 A:

Are all the current files in the same directory? If that is the case then you could use 'opendir' and 'readdir' to read through all the files one by one. Build a hash using the file name as the key (remove all '_' as well as any information inside the brackets) so that you get something like this -

(4)_mr__mcloughlin____.txt -> 'mr mcloughlin'
12__sir_john_farr____.txt -> 'sir john farr'
(b)mr__chope____.txt -> 'mr chope'
dame_elaine_kellett-bowman____.txt -> 'dame elaine kellett-bowman'
dr__blackburn______.txt -> 'dr blackburn'

Set the value of the hash to be the number of instances of the name occurred so far. So after these entries you should have a hash that looks like this -

'mr mcloughlin' => 1
'sir john farr' => 1
'mr chope' => 1
'dame elaine kellett-bowman' => 1
'dr blackburn' => 1

Whenever you come across a new entry in your hash simply create a new directory using the key name. Now all you have to do is copy the file with the changed name (use the corresponding hash value as a suffix) into the new directory. So for eg., of you were to stumble upon another entry which reads as 'mr mcloughlin' then you could copy it as

./mr mcloughlin/mr mcloughlin_2.txt

muteW 2009-02-16 08:15:25

Answer 4

+4 A:

I hope I understand your question right, it's a bit ambiguous IMHO. This code is untested, but should do what I think you want.

use File::Copy;

sub sanatize {
    local $_ = shift;
    s/dame|dr|mr|sir|\d+|\(\w+\)|.txt$//g;
    s/[ _]+/ /g;
    s/^ | $//g;
    return lc $_;
}

sub sort_files_to_dirs {
    my @files = @_;
    for my $filename (@files) {
        my $dirname = sanatize($filename);
        mkdir $dirname if not -e $dirname;
        copy($filename, "$dirname/$filename");
    }
}

Leon Timmermans 2009-02-16 09:05:44

Not sure if it's needed, but this will not preserve titles and dashes in names. You could use this instead: s/([a-zA-Z-_]+).txt$/g Also, I think you have to create directories before you copy into them. Either use mkpath or fcopy.

drby 2009-02-16 10:42:01

Added the directory creation. Stripping out titles is deliberate. It will preserve the dashes in the name. Your character class is not correct, because '-' has special meaning inside one. You should either escape it, or put it as first/last in the character class, else it won't work as advertised.

Leon Timmermans 2009-02-16 10:59:31

I actually wrote the script and tested it. The regex worked (though my script looked a little different). Just try it yourself: my $string = "dame_elaine_kellett-bowman____.txt";if($string =~ m/([a-zA-Z-_]+).txt$/) {print $1;}

drby 2009-02-16 11:13:26

This basically worked just right, I obviously still have to go through a couple thousand folders for similar names to merge, but its saved me a ton of work, which was the point I guess. Too bad I dont have time to sit down and make it really effective with some "intelligence"

gnomed 2009-02-17 20:56:50

Answer 5

+2 A:

I would:

define what's significant in the name:
- is dr__blackburn different than dr_blackburn?
- is dr__blackburn different than mr__blackburn?
- are leading numbers meaningful?
- are leading/trailing underscores meaningful?
- etc.
come up with rules and an algorithm to convert a name to a directory (Leon's is a very good start)
read in the names and process them one at a time
- I would use some combination of opendir and recursion
- I would copy them as you process them; again Leon's post is a great example
if this script will need to be maintained and used in the future, I would defintely create tests (e.g. using http://search.cpan.org/dist/Test-More/) for each regexp path; when you find a new wrinkle, add a new test and make sure it fails, then fix the regex, then re-run the test to make sure nothing broke

Joe Casadonte 2009-02-16 18:54:48

Answer 6

+1 A:

As you are asking a very general question, any language could do this as long as we have a better codification of rules. We don't even have the specifics, only a "sample".

So, working blind, it looks like human monitoring will be needed. So the idea is a sieve. Something you can repeatedly run and check and run again and check again and again until you've got everything sorted to a few small manual tasks.

The code below makes a lot of assumptions, because you pretty much left it to us to handle it. One of which is that the sample is a list of all the possible last names; if there are any other last names, add 'em and run it again.

use strict;
use warnings;
use File::Copy;
use File::Find::Rule;
use File::Spec;
use Readonly;

Readonly my $SOURCE_ROOT    => '/mess/they/left';
Readonly my $DEST_DIRECTORY => '/where/i/want/all/this';

my @lname_list = qw<mcloughlin farr chope kelette-bowman blackburn>;
my $lname_regex 
    = join( '|'
          , sort {  ( $b =~ /\P{Alpha}/ ) <=> ( $a =~ /\P{Alpha}/ )
                 || ( length $b ) <=> ( length $a ) 
                 || $a cmp $b 
                 } @lname_list 
          )
    ;
my %dest_dir_for;

sub get_dest_directory { 
    my $case = shift;
    my $dest_dir = $dest_dir_for{$case};
    return $dest_dir if $dest_dir;

    $dest_dir = $dest_dir_for{$case}
        = File::Spec->catfile( $DEST_DIRECTORY, $case )
        ;
    unless ( -e $dest_dir ) { 
        mkdir $dest_dir;
    }
    return $dest_dir;
}

foreach my $file_path ( 
    File::Find::Rule->file
        ->name( '*.txt' )->in( $SOURCE_ROOT )
) {
    my $file_name =  [ File::Spec->splitpath( $file_path ) ]->[2];
    $file_name    =~ s/[^\p{Alpha}.-]+/_/g;
    $file_name    =~ s/^_//;
    $file_name    =~ s/_[.]/./;

    my ( $case )  =  $file_name =~ m/(^|_)($lname_regex)[._]/i;

    next unless $case;
    # as we next-ed, we're dealing with only the cases we want here. 

    move( $file_path
        , File::Spec->catfile( get_dest_directory( lc $case )
                             , $file_name 
                             )
        );
}

Axeman 2009-02-16 21:44:52

ansaurus

tags:

views:

answers:

How would I sort files to directories based on filenames?

related questions