tags:

views:

131

answers:

2

i have a list

listcdtitles = 

["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """,
..............................]

i have about 14,000 elements in this list

i would like to bunch up those strings together which have similar words.

any ideas on how to do this? i dont think there is a right/wrong way

thank you so much for any advice

+1  A: 

First of all, parse all that and associate each token to a frequency. Tokens with high frequency will have to be blacklisted.

Then you'll have to compare strings, iterating over them, and associating the tuple to a distance score. According to this score, you'll concatenate them - or not.

That would be a simple method to do that.

Guillaume Lebourgeois
+2  A: 

I'm a newbie to python language but I've written a sample code that calculates similarity scores between entries in that list.

The code is as follows.

import re
import array

listcdtitles = ["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """]

entryDictionary = {}
i=0
for entry in listcdtitles:
    #remove unnecessary characters from the string
    entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE)
    #split the entry into words and store it in the 
    entryDictionary[i]=entry.split(" ")
    i=i+1
# print the dictionary
print("Entries")
print(entryDictionary)

# define a score matrix, compare the words in each entry and if
# a word is same in both entries, that is one point
scoreMatrix = []
for k in range(i):
    scoreMatrix.append([])
    for j in range (i):
        if j>k:
            scoreMatrix[k].append(0)
        else:
            scoreMatrix[k].append("-")
k=0
j=0

for k in range(i-1):
    entry1 = entryDictionary[k]
    for j in range(k+1,i):
        entry2 = entryDictionary[j]
        for kk in range(len(entry1)):
            for jj in range(len(entry2)):
                if entry1[kk] != "" and entry1[kk] == entry2[jj]:
                    scoreMatrix[k][j] = scoreMatrix[k][j] + 1

print "Score Matrix (Higher numbers denote heigher similarity between two entries"

print repr("").rjust(10),
for k in range(i-1):
    print repr("Entry " + str(k)).rjust(10),
print repr("Entry " + str(i-1)).rjust(10)

for k in range(i):
    scoreMatrix.append([])
    print repr("Entry " + str(k)).rjust(10),
    for j in range (i-1):
        print repr(scoreMatrix[k][j]).rjust(10),
    print repr(scoreMatrix[k][i-1]).rjust(10)

The result is as follows: Score Matrix (Higher numbers denote higher similarity between two entries

        ''  'Entry 0'  'Entry 1'  'Entry 2'  'Entry 3'  'Entry 4'  'Entry 5'  'Entry 6'  'Entry 7'  'Entry 8'  'Entry 9' 'Entry 10' 'Entry 11' 'Entry 12' 'Entry 13'
 'Entry 0'        '-'          2          3          2          0          1          1          0          1          1          0          0          0          0
 'Entry 1'        '-'        '-'          0          0          0          0         11          0          2          5          0          0          0          0
 'Entry 2'        '-'        '-'        '-'          3          0          1          0          1          0          0          0          0          0          0
 'Entry 3'        '-'        '-'        '-'        '-'          0          4          0          2          0          0          2          0          0          0
 'Entry 4'        '-'        '-'        '-'        '-'        '-'          0          1          0          0          0          0          1          0          0
 'Entry 5'        '-'        '-'        '-'        '-'        '-'        '-'          0          3          1          0          1          1          0          0
 'Entry 6'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          2          5          0          1          0          0
 'Entry 7'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0          0          0
 'Entry 8'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2          0          0          0          0
 'Entry 9'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0
'Entry 10'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0
'Entry 11'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0
'Entry 12'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2
'Entry 13'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'
Zafer
hey zafer this is pretty awesome of you. but i dont really understand what this does?? can u please explain
I__
Hi,I updated the code. There were some bugs in it. Now you can see the matrix. There you'll see the similarity scores between entries.How a sim. score is calculated is as follows:entry 1:'a',B,Dentry 2:A,c,XScore is 1 because of 'a' and A.That's a simple approach. I hope it helps.
Zafer
I've stripped entries with the following line:entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE)So that algorithm just uses words for calculation. Exclusion of some words like "and" can be added to improve the quality.
Zafer