Say I have a string of words: 'a b c d e f'
. I want to generate a list of multi-word terms from this string.
Word order matters. The term 'f e d'
shouldn't be generated from the above example.
Edit: Also, words should not be skipped. 'a c'
, or 'b d f'
shouldn't be generated.
What I have right now:
doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
terms.append(word)
if one_before:
terms.append(' '.join([one_before, word]))
if two_before:
terms.append(' '.join([two_before, one_before, word]))
two_before = one_before
one_before = word
for term in terms:
print term
Prints:
a
b
a b
c
b c
a b c
d
c d
b c d
e
d e
c d e
f
e f
d e f
How would I make this a recursive function so that I can pass it a variable maximum number of words per term?
Application:
I'll be using this to generate multi-word terms from readable text in HTML documents. The overall goal is a latent semantic analysis of a large corpus (about two million documents). This is why keeping word order matters (Natural Language Processing and whatnot).