tags:

views:

105

answers:

3

i have ugly strings that looks like this:

   string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)'
    string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'

i would like some kind of library/algorithm that will give me a percentage of how many words they have in common, but i want to exclude special characters like ',' and ':' and ''' and '{' etc

i know the Levenshtein algorithm exists but this compares how many CHARACTERS, i would like to compare how many WORDS they have in common

+2  A: 
n = 0
words1 = set(sentence1.split())
for word in sentence2.split():
    # strip some chars here, e.g. as in [1]
    if word in words1:
        n += 1

(1: http://stackoverflow.com/questions/875968/how-to-remove-symbols-from-a-string-with-python)

Edit: Note that this considers a word to be common to both sentences if it appears anywhere in both - to compare the position, you can omit the set conversion (just call split() on both), use something like:

n = 0
for word_from_1, word_from_2 in zip(sentence1.split(), sentence2.split()):
    # strip some chars here, e.g. as in [1]
    if word_from_1 == word_from_2:
        n += 1
delnan
which library??
I__
Huh? This uses only built-in functions that are available without importing anything.
delnan
+3  A: 

Regex could easily give you all the words:

import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())

collections.Counter (Python 2.7+) can quickly count up the number of times a word occurs.

from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)

A very crude comparison could be done through set.intersection or difflib.SequenceMatcher, but it sounds like you would want to implement a Levenshtein algorithm that deals with words, where you could use those two lists.

common = set(s1w).intersection(s2w) 
# returns set(['c'])

import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)

Prints: 3.4% of words similar.

Nick T
+1 mainly for collections.Counter - another hidden gem of the stdlib. Sadly it's 2.7 so maybe not appliable.
delnan
+2  A: 

The Lenvenshtein algorithm itself isn't restricted to comparing characters, it could compare any arbitrary objects. The fact that the classical form uses characters is an implementation detail, they could be any symbols or constructs that can be compared for equality.

In Python, convert the strings into lists of words then apply the algorithm to the lists. Maybe someone else can help you with cleaning up unwanted characters, presumably using some regular expression magic.

Simon Hibbs