Is there a good library can detect and split words from a combined string?
Example:
"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
Is there a good library can detect and split words from a combined string?
Example:
"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
I don't know of any library for it, but it shouldn't be hard to implement basic functionality.
words
.Example:
I don't know a library that does this, but it's not too hard to write if you have a list of words:
wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )
def splitString(s):
found = []
def rec(stringLeft, wordsSoFar):
if not stringLeft:
found.append(wordsSoFar)
for pos in xrange(1, len(stringLeft)+1):
if stringLeft[:pos] in words:
rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])
rec(s.lower(), [])
return found
This will return all possible ways to split the string into the given words.
Example:
>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]
if you are not doing this for fun, but is actually doing something for work etc, my advice is to tackle this at the source. Why do you have these strings combined like that? Where did you get those strings? If its possible, insert spaces at the source of where those strings come from.
Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.
WORD_FREQUENCIES = {
'file': 0.00123,
'files': 0.00124,
'save': 0.002,
'ave': 0.00001,
'as': 0.00555
}
def split_text(text, word_frequencies, cache):
if text in cache:
return cache[text]
if not text:
return 1, []
best_freq, best_split = 0, []
for i in xrange(1, len(text) + 1):
word, remainder = text[:i], text[i:]
freq = word_frequencies.get(word, None)
if freq:
remainder_freq, remainder = split_text(
remainder, word_frequencies, cache)
freq *= remainder_freq
if freq > best_freq:
best_freq = freq
best_split = [word] + remainder
cache[text] = (best_freq, best_split)
return cache[text]
print split_text('filesaveas', WORD_FREQUENCIES, {})
--> (1.3653e-08, ['file', 'save', 'as'])