tags:

views:

130

answers:

3

I am have been trying to figure out a way I can detect series of files. For instance:

If a given directory has the following files:

  • Birthday001.jpg
  • Birthday002.jpg
  • Birthday003.jpg
  • Picknic1.jpg
  • Picknic2.jpg
  • Afternoon.jpg.

I would like to get the condense the listing to something like

  • Birthday ( 3 pictures )
  • Picknic ( 2 pictures )
  • Afternoon ( 1 picture )

How should I go about detecting the groups?

+5  A: 

Here's one way you can solve this, which is more efficient than a brute force method.

  • load all the names into an associative array with key equal to the name and value equal to the name but with digits stripped (preg_replace('/\d//g', $key)).

You will have something like $arr1 = [Birthday001 => Birthday, Birthday002 => Birthday ...]

  • now make another associative array with keys that are values from the first array and value which is a count. Increment the count when you've already seen the key.
  • in the end you will end up with a 2nd array that contains the names and counts, just like you wanted. Something like $arr2 = [Birthday => 2, ...]
Artem Russakovskii
This would work if you assume that all semantical tokens are equal once the digits are stripped. This wouldn't address items like "My Birthday001.jpg" and "MyBirthday002.jpg", but a good starting point though.
Kitson
I absolutely agree. However, the question was not posed that way and whoever edited it to include My Birthday and group it with Birthday001, Birthday002 has changed the question considerably. The OP may actually want to group that into 2 different groups.
Artem Russakovskii
Yes, this is pretty much exactly what I am looking for. My main concern was matching the prefix string. This is a great starting point. Thank you.
Ambirex
I rolled back that edit adding the "My Birthday" entry--that was out of line.
Alan Moore
To deal with things like "My Birthday", you could try using the `levenshtein` function to calculate distance between tokens, and automatically group tokens with a distance less than a pre-set threshold.
Tobias Cohen
I had started going down that path, but the question became how do I automagically determine that threshold.
Ambirex
+2  A: 

Simply build a histogram whose keys are modified by a regex:

<?php

# input
$filenames = array("Birthday001.jpg", "Birthday002.jpg", "Birthday003.jpg", "Picknic1.jpg", "Picknic2.jpg", "Afternoon.jpg");

# create histogram
$histogram = array();
foreach ($filenames as $filename) {
    $name = preg_replace('/\d+\.[^.]*$/', '', $filename);
    if (isset($histogram[$name])) {
        $histogram[$name]++;
    } else {
        $histogram[$name] = 1;
    }
}

# output
foreach ($histogram as $name => $count) {
    if ($count == 1) {
        echo "$name ($count picture)\n";
    } else {
        echo "$name ($count pictures)\n";
    }
}

?>
vog
This is almost exactly the same as my version, except in code.
Artem Russakovskii
A: 

Generate an array of words like "my" (developing this array will be very important, "my" is the only one in your example given) and strip these out of all the file names. Strip out all numbers and punctuation, also extensions should be long gone at this point. Once this is done, put all of the unique results into an array. You can then use this as a fairly reliable source of keywords to search for any stragglers that the other processing didn't catch.

Shadow
Note: this answer is based on a revised version of the question which has since been rolled back. That version included a file named "My Birthday.jpg" which was supposed to be grouped with the other "Birthday" files.
Alan Moore