tags:

views:

63

answers:

5

I am trying to break a line on all non-word patterns except .(dot)
Usually I guess it can be done as [\W ^[.]] in java, but how to I do in python?

A: 

Python has a convenience function for that

>>> s = "ab.cd.ef.gh"
>>> s.split(".")
['ab', 'cd', 'ef', 'gh']
Kit
And how does that help the OP with "all non-word patterns except dot"?! This splits by dot only -- poles apart from what the OP asked.
Alex Martelli
D'oh! Poles apart indeed. Haven't had my coffee yet. Sorry about that.
Kit
+2  A: 
>>> import re
>>> the_string="http://hello-world.com"
>>> re.findall(r'[\w.]+',the_string)
['http', 'hello', 'world.com']
gnibbler
Just perfect, Thanks :)could you explain me this??
learner
`[\w^[.]]` is for the delimiters, while `[\w.]+` is for the words, thus we call `findall`.
Satoru.Logic
A: 

I'm assuming that you want to split a string on all non-word patterns except a dot.

Edit: Python doesn't support the Java-style regex syntax that you are using. I'd suggest first replacing all dots with a long string, then splitting the string, then putting the dots back in.

import re
long_str = "ABCDEFGH"
str = str.replace('.', long_str)
result = re.split(r'\W', str)

Then as you are using result, replace all the long_str sequences with a dot again.

This is a very bad solution, but it works.

Dumb Guy
+1  A: 

A very good reference for Python's regular expression module is available here. Following should do the trick for you.

import re
re.split(r'[\w.]+', text_string)

Or,

import re
re.findall('[^\w.]+', text_string)
Ashish
try `text_string="foo|bar."`
gnibbler
@Ashish, nope: almost every special character is "disabled" within "sets" (i.e., between brackets) in a pattern, and in particular so is the vertical bar (in its "or" sense which it would have outside brackets).
Alex Martelli
Fixed it. How's it now?
Ashish
@Alex: I remember your post something on the lines of how Python became a part of Google. Excellent read.
Ashish
Your regexes work now, but you've got them reversed. The `split` regex should be `[^\w.]+` and the `findall` regex should be `[\w.]+`.
Alan Moore
I felt the OP wants non-words but if words are wanted then yes, they are reversed.
Ashish
A: 

Your Java syntax is off, to begin with. This is what you were trying for:

[\W&&[^.]]

That matches a character from the intersection of the sets described by "any non-word character" and "any character except ." But that's overkill when you can just use:

[^\w.]

...or, "any character that's not a word character or .". It's the same in Python (and in most other flavors, too), though you probably want to match one or more of the characters:

re.split(r'[^\w.]+', the_string)

But it's probably simpler to use @gnibbler's approach of matching the parts that you want to keep, not the ones you want to throw away:

re.findall(r'[\w.]+', the_string)
Alan Moore
Thanka Alan, that is really helpful
learner