views:

107

answers:

2

Hello everyone, I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:

  1. Enter a sentence as input -this is called trigger string-

  2. Get longest word in trigger string

  3. Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-

  4. Return the longest sentence that has the word I spoke about in step 3

  5. Append the sentence in Step 1 and Step4 together

  6. Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-

So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents() function but regex - case insensitive search is practically impossible since the gutenberg.sents() outputs the sentences in books as following -in a list of list format-:

EXAMPLE: all the sentences of shakespeare's macbeth is called by typing

import nltk

from nltk.corpus import gutenberg 

gutenberg.sents('shakespeare-macbeth.txt') 

into the python shell command line and output is:

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], 
['Actus', 'Primus', '.'], .......] 

with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.

How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.

+2  A: 

Given a list L of words, and a target word t,

any(t.lower()==w.lower() for w in L)

tells you whether L has word t in a case-insensitive way. It's faster, of course, to do

lt = t.lower()
any(lt==w.lower() for w in L)

since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.

Given a list of lists lol, the longest sub-list including t can be found by

longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)

If multiple sub-lists include t and are of the same maximal length, this will give you the first one, as it happens.

Alex Martelli
Many thanks for helping me save my sanity sir, I really appreciate your help.
sarevok
@sarevok, you're welcome!
Alex Martelli
@Alex Martelli, Hello again sir, I applied your code to my flow but I have not been able to execute the flow I wrote in my question above. I tried to put the entire code in a while loop -creating a variable x =1, then doing while x: so that the flow would continue. However, after retrieving the longest sentence that has this longest word, it just keeps printing the same sentence instead of different ones -despite assigning the "longest" to the new sentence. What should I do ? -by the way, I converted the list 'longest' to a sentence by using .join so that it would act like regular strings-
sarevok
@sarevok, it's impossible to debug the bugs in your code from the way you can describe them in a comment. Please edit your question, or better (since I did answer _this_ one;-), close this question and open another showing the code exactly as you are now trying it.
Alex Martelli
@Alex Martelli, I did as you asked, the question can be found here:http://stackoverflow.com/questions/3571887/python-code-flow-does-not-work-as-expected :)
sarevok
@sarevok, I see you already got an answer there. In any case, since you chose to not close out this question by accepting my answer (despite your enormously thankful initial comment, and the fact that I _have_ answered your question as you had posed it), I must deduce that anything I answer just can't satisfy you, and therefore stop wasting my time looking at your questions until and unless you prove otherwise. Seriously: the **core** principle of stack overflow etiquette is to accept an appropriate and correct answer to your question. Thanks are cheap, acceptances are key.
Alex Martelli
@Alex Martelli, I opened that post so that you could get to view my code. I admit that I am a noob about forum etiquettes and that I have been busy with my main summer internship project along wtih this Python side project. I did not sign up for this site initially because I was too busy. I have been in this world long enough to know that thanks are cheap and of course, I will select your answer as the most helpful since it is, right after I activate my account.
sarevok
@Alex Martelli, also, my apologies if I sounded rude in my last post. By busy I meant that I was in a haste that prevented me from 'bothering' myself with the OpenID process.
sarevok
@sarevok, my apologies in turn if I sounded abrupt -- I had of course no way of knowing you had excellent reasons for neglecting to accept (there's a lot of that going around;0). Let me look at your other Q to see if I can yet add anything to its As (I doubt it, as I see there's several).
Alex Martelli
A: 

How about using the built-in function: str.lower()¶ Return a copy of the string converted to lowercase.

Then just compare the strings.

a2j