I have a text file and I want to display all words that contains both z and x characters.
How can I do that ?
I have a text file and I want to display all words that contains both z and x characters.
How can I do that ?
Sounds like a job for Regular Expressions. Read that and try it out. If you run into problems, update your question and we can help you with the specifics.
Assuming you have the entire file as one large string in memory, and that the definition of a word is "a contiguous sequence of letters", then you could do something like this:
import re
for word in re.findall(r"\w+", mystring):
if 'x' in word and 'z' in word:
print word
>>> import re
>>> pattern = re.compile('\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
>>> document = '''Here is some data that needs
... to be searched for words that contain both z
... and x. Blah xz zx blah jal akle asdke asdxskz
... zlkxlk blah bleh foo bar'''
>>> print pattern.findall(document)
['xz', 'zx', 'asdxskz', 'zlkxlk']
If you don't want to have 2 problems:
for word in file('myfile.txt').read().split():
if 'x' in word and 'z' in word:
print word
>>> import re
>>> print re.findall('(\w*x\w*z\w*|\w*z\w*x\w*)', 'axbzc azb axb abc axzb')
['axbzc', 'axzb']
I do not know the performance of this generator, but for me this is the way:
from __future__ import print_function
import string
bookfile = '11.txt' # Alice in Wonderland
hunted = 'az' # in your case xz but there is none of those in this book
with open(bookfile) as thebook:
# read text of book and split from white space
print('\n'.join(set(word.lower().strip(string.punctuation)
for word in thebook.read().split()
if all(c in word.lower() for c in hunted))))
""" Output:
zealand
crazy
grazed
lizard's
organized
lazy
zigzag
lizard
lazily
gazing
""
"
I just want to point out how heavy-handed some of these regular expressions can be, in comparison to the simple string methods-based solution provided by Wooble.
Let's do some timings, shall we?
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import timeit
import re
import sys
WORD_RE_COMPILED = re.compile(r'\w+')
Z_RE_COMPILED = re.compile(r'(\b\w*z\w*\b)')
XZ_RE_COMPILED = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
##########################
# Tim Pietzcker's solution
# http://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962876#3962876
#
def xz_re_word_find(text):
for word in re.findall(r'\w+', text):
if 'x' in word and 'z' in word:
print word
# Tim's solution, compiled
def xz_re_word_compiled_find(text):
pattern = re.compile(r'\w+')
for word in pattern.findall(text):
if 'x' in word and 'z' in word:
print word
# Tim's solution, with the RE pre-compiled so compilation doesn't get
# included in the search time
def xz_re_word_precompiled_find(text):
for word in WORD_RE_COMPILED.findall(text):
if 'x' in word and 'z' in word:
print word
################################
# Steven Rumbalski's solution #1
# (provided in the comment)
# http://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3963285#3963285
def xz_re_z_find(text):
for word in re.findall(r'(\b\w*z\w*\b)', text):
if 'x' in word:
print word
# Steven's solution #1 compiled
def xz_re_z_compiled_find(text):
pattern = re.compile(r'(\b\w*z\w*\b)')
for word in pattern.findall(text):
if 'x' in word:
print word
# Steven's solution #1 with the RE pre-compiled
def xz_re_z_precompiled_find(text):
for word in Z_RE_COMPILED.findall(text):
if 'x' in word:
print word
################################
# Steven Rumbalski's solution #2
# http://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962934#3962934
def xz_re_xz_find(text):
for word in re.findall(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b', text):
print word
# Steven's solution #2 compiled
def xz_re_xz_compiled_find(text):
pattern = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
for word in pattern.findall(text):
print word
# Steven's solution #2 pre-compiled
def xz_re_xz_precompiled_find(text):
for word in XZ_RE_COMPILED.findall(text):
print word
#################################
# Wooble's simple string solution
def xz_str_find(text):
for word in text.split():
if 'x' in word and 'z' in word:
print word
functions = [
'xz_re_word_find',
'xz_re_word_compiled_find',
'xz_re_word_precompiled_find',
'xz_re_z_find',
'xz_re_z_compiled_find',
'xz_re_z_precompiled_find',
'xz_re_xz_find',
'xz_re_xz_compiled_find',
'xz_re_xz_precompiled_find',
'xz_str_find'
]
import_stuff = functions + [
'text',
'WORD_RE_COMPILED',
'Z_RE_COMPILED',
'XZ_RE_COMPILED'
]
if __name__ == '__main__':
text = open(sys.argv[1]).read()
timings = {}
setup = 'from __main__ import ' + ','.join(import_stuff)
for func in functions:
statement = func + '(text)'
timer = timeit.Timer(statement, setup)
min_time = min(timer.repeat(3, 10))
timings[func] = min_time
for func in functions:
print func + ":", timings[func], "seconds"
Running this script on a plaintext copy of Moby Dick obtained from Project Gutenberg, on Python 2.6, I get the following timings:
xz_re_word_find: 1.21829485893 seconds
xz_re_word_compiled_find: 1.42398715019 seconds
xz_re_word_precompiled_find: 1.40110301971 seconds
xz_re_z_find: 0.680151939392 seconds
xz_re_z_compiled_find: 0.673038005829 seconds
xz_re_z_precompiled_find: 0.673489093781 seconds
xz_re_xz_find: 1.11700701714 seconds
xz_re_xz_compiled_find: 1.12773990631 seconds
xz_re_xz_precompiled_find: 1.13285303116 seconds
xz_str_find: 0.590088844299 seconds
In Python 3.1 (after using 2to3 to fix the print statements), I get the following timings:
xz_re_word_find: 2.36110496521 seconds
xz_re_word_compiled_find: 2.34727501869 seconds
xz_re_word_precompiled_find: 2.32607793808 seconds
xz_re_z_find: 1.32204890251 seconds
xz_re_z_compiled_find: 1.34104800224 seconds
xz_re_z_precompiled_find: 1.34424304962 seconds
xz_re_xz_find: 2.33851099014 seconds
xz_re_xz_compiled_find: 2.29653286934 seconds
xz_re_xz_precompiled_find: 2.32416701317 seconds
xz_str_find: 0.656699895859 seconds
We can see that the regular expression-based functions tend to take twice as long to run as the string methods-based function in Python 2.6, and over 3 times as long in Python 3. The time difference is trivial for one-off parsing (nobody's going to miss those milliseconds), but for cases where the function must be called many times, the string methods-based approach is both simpler and faster.