You have multiple options.
For each option you should probably massage the album names before performing the comparisons. You can do this by stripping punctuation, sorting the words in the album name alphabetically (in certain cases), etc.
In each case, when you do a comparison, if you remove one of the album names from the array, then your comparison is order sensitive, unless you make a rule as to which album name to remove. So, it probably makes sense to always remove the longer album name if two album names are compared and found to be "similar."
The main comparison options are
Simple substring comparisons. Check if an album name is inside another. Strip punctuation first and compare case insensitively (see my second code snippet below).
Check album name similarity using levenshtein()
. This string comparison is more efficient then similar_text()
. You should strip punctuation and order words alphabetically.
Check album name similarity using similar_text()
. I had the most luck with this method. In fact I got it to pick the exact album names you wanted (see first code snippet below).
There are various other string comparison functions you can play around with including soundex()
and metaphone()
Anyway... here are 2 solutions.
The first uses similar_text()
... but it calculates the similarity only after all punctuation has been stripped and words put into alphabetical order and lowercased...... the downside is you have to play around with the threshold similarities... The second uses a simple case insensitive substring test after all punctuation and white space is stripped.
The way both code snippets work is that they use array_walk()
to run the compare()
function on each album in the array. Then inside the compare()
function, I use foreach()
to compare the current album to all the other albums. There's ample room to make things more efficient.
Note that I should be using the 3rd argument as a reference in array_walk
can someone help me do this? The current work around is a global variable:
Live example (69% similarity threshold)
function compare($value, $key)
{
global $array; // Should use 3rd argument of compare instead
$value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
$value = explode(" ", $value);
sort($value);
$value = implode($value);
$value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse, and lower case the string
$value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
$value2 = explode(" ", $value2);
sort($value2);
$value2 = implode($value2);
$value2 = preg_replace("/[\s]/", "", $value2);
// Set up the similarity
similar_text($value, $value2, $sim);
if ($sim > 69)
{ // Remove the longer album name
unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);
The output of the above is:
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding
[6] => Band of Horses, On My Way Back Home
[7] => Band of Horses - cigarettes wedding bands
[8] => Band Of Horses - I Go To The Barn Because I Like The
[9] => Our Swords - Band of Horses
[10] => Band of Horses - Monsters
)
Note that the short version of Mary's song is missing... so it must have been a false positive against something else, since the long version is still in the list..... but they are precisely the album names you wanted.
The substring method:
function compare($value, $key)
{
// I should be using &$array as a 3rd variable.
// For some reason couldn't get that to work, so I do this instead.
global $array;
// Take the current album name and remove all punctuation and white space
$value = preg_replace("/[^a-zA-Z0-9]/", "", $value);
// Compare current album to all othes
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse the album being compared to
$value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);
$subject = $value2;
$pattern = '/' . $value . '/i';
// If there's a much remove the album being compared to
if (preg_match($pattern, $subject))
{
unset($array[$key2]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";
For your example string the above outputs (it shows 2 that you don't want shown):
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding // <== Oops
[6] => 'Laredo' by Band of Horses on Q TV // <== Oops
[7] => Band of Horses, On My Way Back Home
[8] => Band of Horses - cigarettes wedding bands
[9] => Band Of Horses - I Go To The Barn Because I Like The
[10] => Our Swords - Band of Horses
[11] => Band Of Horses - "Marry song"
[12] => Band of Horses - Monsters
)