200000 is not that much, you can do this
- Split each string to get tokens
e.g. "News CA BLAH" -> ["Blah", "CA", "News"]
- create a dict entry each length of list e.g. in case of ["Blah", "CA", "News"] all combinations in order
- Now just loop thru the dict and see the groups
example code:
data="""AB 500
Bus AB 500
News CA
News CA BLAH"""
def getCombinations(tokens):
count = len(tokens)
for L in range(1,count+1):
for i in range(count-L+1):
yield tuple(tokens[i:i+L])
groupDict = {}
for s in data.split("\n"):
tokens = s.split()
for groupKey in getCombinations(tokens):
if groupKey not in groupDict:
groupDict[groupKey] = [s]
else:
groupDict[groupKey].append(s)
for group, values in groupDict.iteritems():
if len(values) > 1:
print group, "->", values
it outputs:
('News', 'CA') -> ['News CA', 'News CA BLAH']
('AB',) -> ['AB 500', 'Bus AB 500']
('500',) -> ['AB 500', 'Bus AB 500']
('CA',) -> ['News CA', 'News CA BLAH']
('AB', '500') -> ['AB 500', 'Bus AB 500']
('News',) -> ['News CA', 'News CA BLAH']