tags:

views:

98

answers:

5

Despite attempts to master grep and related GNU software, I haven't come close to mastering regular expressions. I do like them, but I find them a bit of an eyesore all the same.

I suppose this question isn't difficult for some, but I've spent hours trying to figure out how to search through my favorite book for words greater than a certain length, and in the end, came up with some really ugly code:

twentyfours = [w for w in vocab if re.search('^........................$', w)]
twentyfives = [w for w in vocab if re.search('^.........................$', w)]
twentysixes = [w for w in vocab if re.search('^..........................$', w)]
twentysevens = [w for w in vocab if re.search('^...........................$', w)]
twentyeights = [w for w in vocab if re.search('^............................$', w)]

... a line for each length, all the way from a certain length to another one.

What I want instead is to be able to say 'give me every word in vocab that's greater than eight letters in length.' How would I do that?

+11  A: 

You don't need regex for this.

result = [w for w in vocab if len(w) >= 8]

but if regex must be used:

rx = re.compile('^.{8,}$')
#                  ^^^^ {8,} means 8 or more.
result = [w for w in vocab if rx.match(w)]

See http://www.regular-expressions.info/repeat.html for detail on the {a,b} syntax.

KennyTM
Thanks. Everyone gave great answers, but yours showed that there's both a simple and a complicated way, so I marked yours. Not that you lack reputation 'round these here parts.
old Ixfoxleigh
-1 for unnecessary anchors
unbeli
@unbeli Anchors?
old Ixfoxleigh
@tsimotki no need to use ^ and $ here, it makes it less readable and probably slows down matching
unbeli
@unbeli: Anchored REs actually match faster than unanchored; there's much less backtracking that the RE engine can do. But using the string's length or doing a direct equality test, they're faster than any RE doing even approximately equivalent tasks.
Donal Fellows
@Donal Fellows nope, not in this case. There is nothing to backtrack for, it's ".". RE with anchors has to examine the whole input, while RE without can stop after finding 8 chars.
unbeli
@unbeli: Did you test whether you are correct by actually timing it? In my experience with debugging RE engines and the code that depends on them, anchored matches are fastest.
Donal Fellows
@Donal Fellows no, I didn't. Did you?
unbeli
@unbeli: Yes I did, but not with Python (which I can read, but not write at any real level of skill). HTH. YMMV.
Donal Fellows
@Donal Fellows this means you didn't. And I did now, and, as expected, it's a little bit slower with anchors. Which is completely unimportant, because the main point is: it's clutter, making the RE less readable.
unbeli
A: 

^.{8,}$

This will match something that has at least 8 characters. You can also place a number after the coma to limit the upper bound or remove the first number to not restrict the lower bound.

unholysampler
-1 for unnecessary anchors
unbeli
A: 

if you do want to use a regular expression

result = [ w for w in vocab if re.search('^.{24}',w) ]

the {x} says match x characters. but it is probably better to use len(w)

Andy
Why [.]? This will match exactly 24 dots.
KennyTM
opps! fixed. thanks
Andy
+4  A: 

\w will match letter and characters, {min,[max]} allows you to define size. An expression like

\w{9,}

will give all letter/number combinations of 9 characters or more

Ivo van der Wijk
+1 for explaining what the components mean
Zack
+1 for not anchoring it
unbeli
That's very concise. Thanks.
old Ixfoxleigh
@unbeli What's an anchor?
old Ixfoxleigh
@tsimotki "^" and "$"
unbeli
I can see the "anchor" in your eyes.
Paul McGuire
A: 

.{9,} for "more than eight", .{8,} for "eight or more"
Or just len(w) > 8

unbeli
thanks for revenge -1 :)
unbeli