ansaurus

Question

python: how many similar words in string?

Answer 1

+2 A:

n = 0
words1 = set(sentence1.split())
for word in sentence2.split():
    # strip some chars here, e.g. as in [1]
    if word in words1:
        n += 1

(1: http://stackoverflow.com/questions/875968/how-to-remove-symbols-from-a-string-with-python)

Edit: Note that this considers a word to be common to both sentences if it appears anywhere in both - to compare the position, you can omit the set conversion (just call split() on both), use something like:

n = 0
for word_from_1, word_from_2 in zip(sentence1.split(), sentence2.split()):
    # strip some chars here, e.g. as in [1]
    if word_from_1 == word_from_2:
        n += 1

delnan 2010-08-24 16:45:48

which library??

I__ 2010-08-24 16:46:19

Huh? This uses only built-in functions that are available without importing anything.

delnan 2010-08-24 16:48:18

Answer 2

+3 A:

Regex could easily give you all the words:

import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())

collections.Counter (Python 2.7+) can quickly count up the number of times a word occurs.

from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)

A very crude comparison could be done through set.intersection or difflib.SequenceMatcher, but it sounds like you would want to implement a Levenshtein algorithm that deals with words, where you could use those two lists.

common = set(s1w).intersection(s2w) 
# returns set(['c'])

import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)

Prints: 3.4% of words similar.

Nick T 2010-08-24 16:52:15

+1 mainly for collections.Counter - another hidden gem of the stdlib. Sadly it's 2.7 so maybe not appliable.

delnan 2010-08-24 16:56:38

Answer 3

+2 A:

The Lenvenshtein algorithm itself isn't restricted to comparing characters, it could compare any arbitrary objects. The fact that the classical form uses characters is an implementation detail, they could be any symbols or constructs that can be compared for equality.

In Python, convert the strings into lists of words then apply the algorithm to the lists. Maybe someone else can help you with cleaning up unwanted characters, presumably using some regular expression magic.

Simon Hibbs 2010-08-24 16:52:36

ansaurus

tags:

views:

answers:

python: how many similar words in string?

related questions