tags:

views:

102

answers:

5

Let say I have an array like this:

  • Band of Horses - Is There a Ghost

  • Band Of Horses - No One's Gonna Love You

  • Band of Horses - The Funeral

  • Band of Horses - The Funeral (lyrics in description)

  • Band of Horses - Laredo

  • Band Of Horses - Laredo on Letterman 5.20.10

  • Band of Horses - "The Great Salt Lake" Sub Pop Records

  • Band Of Horses - "No One's Gonna Love You"

  • Band of Horses perform Marry Song at Tromso Wedding

  • Band Of Horses - No One's Gonna Love You

  • 'Laredo' by Band of Horses on Q TV

  • Band of Horses, On My Way Back Home

  • Band of Horses - cigarettes wedding bands

  • Band Of Horses - "Cigarettes Wedding Bands"

  • Band Of Horses - I Go To The Barn Because I Like The

  • Our Swords - Band of Horses

  • Band Of Horses - "Marry song"

  • Band of Horses - Monsters

  • Band Of Horses - No One's Gonna Love You

The New Array would have:

  • Band of Horses - Is There a Ghost

  • Band Of Horses - No One's Gonna Love You

  • Band of Horses - The Funeral

  • Band of Horses - Laredo

  • Band of Horses - "The Great Salt Lake" Sub Pop Records

  • Band of Horses, On My Way Back Home

  • Band of Horses - cigarettes wedding bands

  • Band Of Horses - I Go To The Barn Because I Like The

  • Our Swords - Band of Horses

  • Band Of Horses - "Marry song"

  • Band of Horses - Monsters

How would you go about comparing each string to every other string in the list in PHP and if they are similar, remove them.

I consider these similar:

  • Band of Horses - The Funeral
  • Band of Horses - The Funeral (lyrics in description)

Another example:

  • Band of Horses - Laredo

  • Band Of Horses - Laredo on Letterman 5.20.10

A: 

Tak a look at array_unique() function:

http://php.net/manual/en/function.array-unique.php

Tomasz Kowalczyk
It seems like this function is looking for exact matches.
Cody.Stewart
yes, i thought it will make your day. if not, take a look at another function: http://php.net/manual/en/function.levenshtein.php - it will calculate similarity and depending on that information you can implement your own function comparing and removing "similar duplicates".
Tomasz Kowalczyk
Meh - What about the example - `Band Of Horses - No One's Gonna Love You` and `Band Of Horses - "No One's Gonna Love You"` should be recognized as the same.
Peter Ajtai
Maybe I can count each word in an array item then see how many words match up exactly in the next item. Seems like A LOT of looping.
Cody.Stewart
+2  A: 

You might want to try similar_text, maybe in combination with levenshtein, and through experimentation establish a threshold of what score you consider to be similar enough. Also look at the user discussions for some more hints. You can then loop through the array, compare each element to each other element and remove elements that you deem too similar.

I hope this serves as a start for you. The problem is rather complex, since there are many things that may be considered to have the same content, but have completely different syntax ("Our Swords - Band of Horses" vs. "Band of Horses - Our Swords"). It depends if this rather simplistic solution is good enough for what you're trying to do.

deceze
Nice functions. Didn't know about either.
Peter Ajtai
deceze
+8  A: 

You have multiple options.

For each option you should probably massage the album names before performing the comparisons. You can do this by stripping punctuation, sorting the words in the album name alphabetically (in certain cases), etc.

In each case, when you do a comparison, if you remove one of the album names from the array, then your comparison is order sensitive, unless you make a rule as to which album name to remove. So, it probably makes sense to always remove the longer album name if two album names are compared and found to be "similar."

The main comparison options are

  1. Simple substring comparisons. Check if an album name is inside another. Strip punctuation first and compare case insensitively (see my second code snippet below).

  2. Check album name similarity using levenshtein(). This string comparison is more efficient then similar_text(). You should strip punctuation and order words alphabetically.

  3. Check album name similarity using similar_text(). I had the most luck with this method. In fact I got it to pick the exact album names you wanted (see first code snippet below).

  4. There are various other string comparison functions you can play around with including soundex() and metaphone()

Anyway... here are 2 solutions.

The first uses similar_text()... but it calculates the similarity only after all punctuation has been stripped and words put into alphabetical order and lowercased...... the downside is you have to play around with the threshold similarities... The second uses a simple case insensitive substring test after all punctuation and white space is stripped.

The way both code snippets work is that they use array_walk() to run the compare() function on each album in the array. Then inside the compare() function, I use foreach() to compare the current album to all the other albums. There's ample room to make things more efficient.

Note that I should be using the 3rd argument as a reference in array_walk can someone help me do this? The current work around is a global variable:


Live example (69% similarity threshold)


function compare($value, $key)
{
    global $array; // Should use 3rd argument of compare instead

    $value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
    $value = explode(" ", $value);
    sort($value);
    $value = implode($value);
    $value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s

    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {
            // collapse, and lower case the string            
            $value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
            $value2 = explode(" ", $value2);
            sort($value2);
            $value2 = implode($value2);            
            $value2 = preg_replace("/[\s]/", "", $value2);

              // Set up the similarity
            similar_text($value, $value2, $sim);
            if ($sim > 69)
            {     // Remove the longer album name
                unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);

The output of the above is:

Array
(
    [0] => Band of Horses - Is There a Ghost
    [1] => Band Of Horses - No One's Gonna Love You
    [2] => Band of Horses - The Funeral
    [3] => Band of Horses - Laredo
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
    [5] => Band of Horses perform Marry Song at Tromso Wedding
    [6] => Band of Horses, On My Way Back Home
    [7] => Band of Horses - cigarettes wedding bands
    [8] => Band Of Horses - I Go To The Barn Because I Like The
    [9] => Our Swords - Band of Horses
    [10] => Band of Horses - Monsters
)

Note that the short version of Mary's song is missing... so it must have been a false positive against something else, since the long version is still in the list..... but they are precisely the album names you wanted.


The substring method:

Live Example


function compare($value, $key)
{
      // I should be using &$array as a 3rd variable.
      // For some reason couldn't get that to work, so I do this instead.
    global $array;   
      // Take the current album name and remove all punctuation and white space
    $value = preg_replace("/[^a-zA-Z0-9]/", "", $value);        
      // Compare current album to all othes
    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {

              // collapse the album being compared to
            $value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);

            $subject = $value2;
            $pattern = '/' . $value . '/i';

              // If there's a much remove the album being compared to
            if (preg_match($pattern, $subject))
            {
                unset($array[$key2]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";

For your example string the above outputs (it shows 2 that you don't want shown):

Array  
(  
    [0] => Band of Horses - Is There a Ghost  
    [1] => Band Of Horses - No One's Gonna Love You  
    [2] => Band of Horses - The Funeral  
    [3] => Band of Horses - Laredo  
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records  
    [5] => Band of Horses perform Marry Song at Tromso Wedding      // <== Oops
    [6] => 'Laredo' by Band of Horses on Q TV                       // <== Oops  
    [7] => Band of Horses, On My Way Back Home  
    [8] => Band of Horses - cigarettes wedding bands  
    [9] => Band Of Horses - I Go To The Barn Because I Like The  
    [10] => Our Swords - Band of Horses  
    [11] => Band Of Horses - "Marry song"  
    [12] => Band of Horses - Monsters  
)
Peter Ajtai
+1: For effort.
shamittomar
A: 

Here's my (somewhat complicated?) solution.

It splits up the input strings into an array of words (getWords). Then, it compares them all with each other, grouping them by "equality" (titlesMatch), which doesn't care for word ordering. It stores arrays for reach group of matches so you can review similar titles.

Here's the script (assuming $array is the input):

function getWords($str) {
    // Remove non-alpha characters and split by spaces
    $normalized = preg_replace('/[^a-z0-9\s]/', '', strtolower($str));
    $words = preg_split('/\s+/', $normalized, -1, PREG_SPLIT_NO_EMPTY);

    return $words;
}

function titlesMatch($words1, $words2) {
    $intersection = array_intersect($words1, $words2);

    sort($words1);
    sort($words2);
    sort($intersection);

    return $intersection === $words1 || $intersection === $words2;
}

$wordedArray = array_map('getWords', $array);

$uniqueItems = array();

foreach ($wordedArray as $words1Index => $words1) {
    $isUnique = true;

    foreach ($uniqueItems as &$words2Indices) {
        foreach ($words2Indices as $words2Index) {
            if (titlesMatch($words1, $wordedArray[$words2Index])) {
                $words2Indices[] = $words1Index;
                $isUnique = false;

                break;
            }
        }
    }

    if ($isUnique) {
        $uniqueItems[] = array($words1Index);
    }
}

// Show the first matches as an example
foreach ($uniqueItems as $indices) {
    echo $array[$indices[0]] . "\n";
}

Output with your input data:

Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
Band of Horses - Laredo
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters

(Note: this looks O(n3) but it's really O(n2).)

strager
This one is order dependent. Sorting by shortest phrase first could help mitigate that. It also would fail on words that are misspelled or slightly different, but, of course, that makes it less prone to false positives as well...
konforce
A: 

The best implementation will depend greatly on your data. The more you know about your data, the better results you can get with the least amount of work. Anyway, here's a sample script I put together:

<?php
    $list = array(); # source data

    $groups = array();

    foreach ($list as $item)
    {
        $words = array_unique(explode(' ', trim(preg_replace('/[^a-z]+/', ' ', strtolower($item)))));

        $matches = array();

        foreach ($groups as $i => $group)
        {
            foreach ($group as $g)
            {
                if (count($words) < count($g['words']))
                {
                    $a = $words;
                    $b = $g['words'];
                }
                else
                {
                    $a = $g['words'];
                    $b = $words;
                }

                $c = 0;
                foreach ($a as $word1)
                {
                    foreach ($b as $word2)
                    {
                        if (levenshtein($word1, $word2) < 2)
                        {
                            ++$c;
                            break;
                        }
                    }
                }

                if ($c / count($a) > 0.85)
                {
                    $matches[] = $i;
                    continue 2;
                }
            }           
        }

        $me = array('item' => $item, 'words' => $words);

        if (!$matches)
            $groups[] = array($me);
        else
        {
            for ($i = 1; $i < count($matches); ++$i)
            {
                $groups[$matches[0]] = array_merge($groups[$matches[0]], $groups[$matches[$i]]);
                unset($groups[$matches[$i]]);
            }

            $groups[$matches[0]][] = $me;
        }
    }

    foreach ($groups as $group)
    {
        echo $group[0]['item']."\n";
        for ($i = 1; $i < count($group); ++$i)
            echo "\t".$group[$i]['item']."\n";
    }
?>

The output with your list:

Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
    Band Of Horses - "No One's Gonna Love You"
    Band Of Horses - No One's Gonna Love You
    Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
    Band of Horses - The Funeral (lyrics in description)
Band of Horses - Laredo
    Band Of Horses - Laredo on Letterman 5.20.10
    'Laredo' by Band of Horses on Q TV
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
    Band Of Horses - "Marry song"
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
    Band Of Horses - "Cigarettes Wedding Bands"
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters

The basic principle here is to group like list items together. Any new item coming in is compared against the existing groups. The shorter item is checked against the larger ones. If enough of the words (85%) are close enough (2 characters different), then it is considered a match, and is added to the list.

This might be good enough for you, if you tweak the parameters. Other things to consider: ignoring small words altogether, similar phrases, etc.

konforce