views:

108

answers:

3

Hi all,

Intro

I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again. They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point

Question

Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.

The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.

What algorithm would you use to parse all the file name list and generate a most likely regex pattern?

+2  A: 

There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is

my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";

outputs:

 this\ is\ (?:Perl|Ruby)

Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.

Kinopiko
+1  A: 

First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.

What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.

I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.

At the end, I would report as follows:

nnnnnnnnnn filenames processed

Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.

T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.

If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.

For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.

Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]

Michael Dillon
+1  A: 

For Python, see this question about TemplateMaker.

Ikke
I just remembered that the LOCI guys did something similar, that is aimed at this kind of image numbering (amongst other things, they develop a library to import/export microscopy images). Look here: <a href https://skyking.microscopy.wisc.edu/svn/java/trunk/components/bio-formats/src/loci/formats/FilePattern.java />
Jolow