ansaurus

Question

Classify array of strings based on commonalities

Answer 1

+1 A:

Do you mean something like this?

>>> from collections import defaultdict
>>> L=["AB 500",
... "Bus AB 500",
... "News CA",
... "News CA BLAH"]
>>> d=defaultdict(list)
>>> for s in L:
...     for w in s.split():
...         d[w].append(s)
... 
>>> print d["News"]
['News CA', 'News CA BLAH']
>>> print d["CA"]
['News CA', 'News CA BLAH']
>>> print d["500"]
['AB 500', 'Bus AB 500']

gnibbler 2009-11-12 04:57:51

Answer 2

+2 A:

200000 is not that much, you can do this

Split each string to get tokens e.g. "News CA BLAH" -> ["Blah", "CA", "News"]
create a dict entry each length of list e.g. in case of ["Blah", "CA", "News"] all combinations in order
Now just loop thru the dict and see the groups

example code:

data="""AB 500
Bus AB 500
News CA
News CA BLAH"""

def getCombinations(tokens):
    count = len(tokens)
    for L in range(1,count+1):
        for i in range(count-L+1):
            yield tuple(tokens[i:i+L])

groupDict = {}
for s in data.split("\n"):
    tokens = s.split()
    for groupKey in getCombinations(tokens):
        if groupKey not in groupDict:
            groupDict[groupKey] = [s]
        else:
            groupDict[groupKey].append(s)

for group, values in groupDict.iteritems():
    if len(values) > 1:
        print group, "->", values

it outputs:

('News', 'CA') -> ['News CA', 'News CA BLAH']
('AB',) -> ['AB 500', 'Bus AB 500']
('500',) -> ['AB 500', 'Bus AB 500']
('CA',) -> ['News CA', 'News CA BLAH']
('AB', '500') -> ['AB 500', 'Bus AB 500']
('News',) -> ['News CA', 'News CA BLAH']

Anurag Uniyal 2009-11-12 04:58:55

Answer 3

+1 A:

Unless repetition of words is an important feature for your use case, I suggest sets. I.e.:

thestrings = [
"AB 500",
"Bus AB 500",
"News CA",
"News CA BLAH",
]

thesets = dict((s, set(s.split())) for s in thestrings)

similarities = dict()
for s in thestrings:
  for o in thestrings:
    if s>=o: continue
    sims = len(thesets[s] & thesets[o])
    if not sims: continue
    similarities[s, o] = sims

for s, o in sorted(similarities, similarities.get, reverse=True):
  print "%-16r %-16r %2d" % (s, o, similarities[s, o])

Is this close to what you're looking for? It does classify the 4 strings you give in the way you desire, but that's a very feeble sample, of course, so I'm double checking;-).

Alex Martelli 2009-11-12 05:30:10

This will be very very slow, as OP said 200000 strings, means looping 40k milllion times

Anurag Uniyal 2009-11-12 08:28:33

Yep, it's `O(n**2)`, but if it's close to what the OP desires then it's a simple base for optimizations, both algorithmic and heuristic ones. The question as it stands is just a bit under-determined;-).

Alex Martelli 2009-11-12 15:55:19

Answer 4

A:

What would happen, if the string "AB 500 News CA" is added to your list? Do the two groups of strings have to merge? If not, how to split up the list of strings and why?

A very general workflow for problems like this (if i understood it correctly) goes like this:

Get a list of candidate pairs via an inverted Index/All pairs similarity search/Simhashing
Calc some distance functions for each pair and combine them into a single weight
Each weighted pair ((a, b), weight) represents now an edge in a graph, which you can cluster into the "word-match groups" via hierarchical clustering/power iteration

ephes 2009-11-13 10:29:02

ansaurus

tags:

views:

answers:

Classify array of strings based on commonalities

related questions