ansaurus

Question

Automatically finding numbering patterns in filenames

Answer 1

+2 A:

There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is

my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";

outputs:

 this\ is\ (?:Perl|Ruby)

Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.

Kinopiko 2009-11-03 14:04:15

Answer 2

+1 A:

First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.

What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.

I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.

At the end, I would report as follows:

nnnnnnnnnn filenames processed

Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.

T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.

If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.

For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.

Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]

Michael Dillon 2009-11-03 14:07:31

Answer 3

+1 A:

For Python, see this question about TemplateMaker.

Ikke 2009-11-03 14:08:12

I just remembered that the LOCI guys did something similar, that is aimed at this kind of image numbering (amongst other things, they develop a library to import/export microscopy images). Look here: <a href https://skyking.microscopy.wisc.edu/svn/java/trunk/components/bio-formats/src/loci/formats/FilePattern.java />

Jolow 2009-11-03 16:18:46

ansaurus

tags:

views:

answers:

Automatically finding numbering patterns in filenames

related questions