I am trying to break a line on all non-word patterns except .(dot)
Usually I guess it can be done as [\W ^[.]] in java, but how to I do in python?
views:
63answers:
5Python has a convenience function for that
>>> s = "ab.cd.ef.gh"
>>> s.split(".")
['ab', 'cd', 'ef', 'gh']
>>> import re
>>> the_string="http://hello-world.com"
>>> re.findall(r'[\w.]+',the_string)
['http', 'hello', 'world.com']
I'm assuming that you want to split a string on all non-word patterns except a dot.
Edit: Python doesn't support the Java-style regex syntax that you are using. I'd suggest first replacing all dots with a long string, then splitting the string, then putting the dots back in.
import re
long_str = "ABCDEFGH"
str = str.replace('.', long_str)
result = re.split(r'\W', str)
Then as you are using result
, replace all the long_str
sequences with a dot again.
This is a very bad solution, but it works.
A very good reference for Python's regular expression module is available here. Following should do the trick for you.
import re
re.split(r'[\w.]+', text_string)
Or,
import re
re.findall('[^\w.]+', text_string)
Your Java syntax is off, to begin with. This is what you were trying for:
[\W&&[^.]]
That matches a character from the intersection of the sets described by "any non-word character" and "any character except .
" But that's overkill when you can just use:
[^\w.]
...or, "any character that's not a word character or .
". It's the same in Python (and in most other flavors, too), though you probably want to match one or more of the characters:
re.split(r'[^\w.]+', the_string)
But it's probably simpler to use @gnibbler's approach of matching the parts that you want to keep, not the ones you want to throw away:
re.findall(r'[\w.]+', the_string)