tags:

views:

258

answers:

2

I'm trying to divide long text in small parts, so that every part is at least N characters and ended with some of the stop punctuation marks (? . !). If the part is bigger than N characters we sttoped when the next punctuation mark appear.

For example :

Lets say N = 10

Do you want lime? Yes. I love when I drink tequila. 

This sentence should be divided in two parts.

[1] Do you want lime?
[2] Yes. I love when I drink tequila.
+2  A: 

Maybe like this? (Thanks to KennyTM for final optimizations.)

.{10}[^.?!]*[.?!]+
Thomas
No need to escape these characters inside character classes. And this regex will always match the entire string.
Tim Pietzcker
You're right, there was a problem. The escaping doesn't hurt, I believe.
Thomas
I also try this: prog = re.compile('/([^\.\?\!]{10,}[^\.\?\!]*?[\.\?\!]+)+/')result = prog.split(test). But the result was whole sentence from the example.
Ilija
@Ilija: I think you want to use prog.findAll() instead of prog.split(). I've updated the regex for use in Python.
Thomas
Yes, this will work. Thanks a lot.
Ilija
You could use `{10}` as a short cut of `{10,10}`. Also, I think `[^.?!]*` don't need to be lazy any more.
KennyTM
@KennyTM: Yes to the shortcut. The expression evolved over time, so this small improvement was overlooked. ;) The [^.?!]*? is absolutely necessary (unless I'm missing something) to switch to non-greedy matching. Otherwise it would just match all of: 'Hello, this is part 1. And this should be part 2.', instead of matching only part 1 and part 2 in distinct matches. It depends on Python's default setting for greediness, of course.
Thomas
@Thomas: No it won't match all of them. Notice that it is `[^.?!]*` not `.*`.
KennyTM
@KennyTM: Haha, darn, got lost in my own expression. You're absolutely right. With that change, it should be as concise as it gets, I hope.
Thomas
+2  A: 
.{10,}?[.!?]+\s*

should work. It will also keep repeated punctuation characters together, so it splits Do you want lime??? Yes. I love when I drink tequila. into Do you want lime??? and Yes. I love when I drink tequila.

However, it doesn't take quoted speech into account and will break Peter said "Hi! How about dinner tonight?" and left. into Peter said "Hi!, How about dinner tonight? and " and left.

Could that be a problem that needs to be taken into account?

Tim Pietzcker
No, that doesn't need to be taken into account? Now, I'm testing,
Ilija
How would it split: Hello ?? !! ... My name is not important.So I'm not saying it ...
Thomas
I'm trying this in Python prog = re.compile('.{10,}?[.!?]+\s*')result = prog.split(test) but it gives me emtpy list
Ilija
@Thomas: Then it should probably be .{10,}?[.!?][\s.!?]*
ninjalj