ansaurus

Question

What is a regular expression for parsing out individual sentences?

Answer 1

+7 A:

Try this @"(\S.+?[.!?])(?=\s+|$)"

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
    int i = match.Index;
    Console.WriteLine(match.Value);
}

Results

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

For Complicated one, Of course, you will need real parser like SharpNLP or NLTK. Mine is just quick and Dirty one.

Here is the SharpNLP Info, and Features

SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:

a sentence splitter
a tokenizer
a part-of-speech tagger
a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
a parser
a name finder
a coreference tool
an interface to the WordNet lexical database

S.Mark 2009-12-20 17:20:11

+1 for pointing us to SharpNLP which I hadn't seen before and may be very useful.

peter.murray.rust 2009-12-20 17:41:35

Better use a look-ahead assertion for `(?:\s+|$)`.

Gumbo 2009-12-21 08:27:11

Thanks for info Gumbo, its better, but I had to add \S in the front, because whitespaces have to strip at left side.

S.Mark 2009-12-21 08:58:10

Thanks everyone. This has been useful insight. I will try it out over the next few days.

Luke Machowski 2009-12-22 20:42:32

Answer 2

+1 A:

This is not really possible with only regular expressions, unless you know exactly which "difficult" tokens you have, such as "i.d.", "Mr.", etc. For example, how many sentences is "Please show your I.D, Mr. Bond."? I'm not familiar with any C#-implementations, but I've used NLTK's Punkt tokenizer. Probably should not be too hard to re-implement.

Alex Brasetvik 2009-12-20 17:23:38

Answer 3

+1 A:

var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

I tested this in LINQPad.

SLaks 2009-12-20 17:24:08

Thanks for trying it out.

Luke Machowski 2009-12-22 20:46:25

Answer 4

+3 A:

It is impossible to use regexes to parse natural language. What is the end of a sentence? A period can occur in many places (e.g. e.g.). You should use a natural language parsing toolkit such as OpenNLP or NLTK. Unfortunately there are very few, if any, offerings in C#. You may therefore have to create a webservice or otherwise link into C#.

Note that it will cause problems in the future if you rely on exact whitespace as in "I.D.". You'll soon find examples that break your regex. For example most people put spaces after their intials.

There is an excellent summary of Open and commercial offerings in WP (http://en.wikipedia.org/wiki/Natural%5Flanguage%5Fprocessing%5Ftoolkits). We have used several of them. It's worth the effort.

[You use the word "train". This is normally associated with machine-learning (which is one approach to NLP and has been used for sentence-splitting). Indeed the toolkits I have mentioned include machine learning. I suspect that wasn't what you meant - rather that you would evolve your expression through heuristics. Don't!]

peter.murray.rust 2009-12-20 17:29:47

THanks for that info. I am always intrigued in the machine-learning aspect of this and this is one aspect that I would like to investigate. For my current purpose, I actually think that the simple regex approach (where I don't expect these weird cases you speak of) to be just fine. However, I will try the frameworks you speak of because they already exist.

Luke Machowski 2009-12-22 20:45:27

Answer 5

A:

I have used the suggestions posted here and come up with the regex that seams to achieve what I want to do:

(?\S.+?(?[.!?]|\Z))(?=\s+|\Z)

I used a very fantastic tool for playing with (and interpretting) it. It's called Expresso. ( http://www.ultrapico.com/ )

This is what it says:

//  using System.Text.RegularExpressions;
/// <summary>
///  Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
///      \S.+?(?<Terminator>[.!?]|\Z)
///          Anything other than whitespace
///          Any character, one or more repetitions, as few as possible
///          [Terminator]: A named capture group. [[.!?]|\Z]
///              Select from 2 alternatives
///                  Any character in this class: [.!?]
///                  End of string or before new line at end of string
///  Match a suffix but exclude it from the capture. [\s+|\Z]
///      Select from 2 alternatives
///          Whitespace, one or more repetitions
///          End of string or before new line at end of string
///  
///
/// </summary>
public static Regex regex = new Regex(
      "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
    RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "$& [${Day}-${Month}-${Year}]";


//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);

//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);

//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);

//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);

//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);

//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();

//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

Thanks for everyones help!

Luke Machowski 2009-12-27 13:07:19

ansaurus

tags:

views:

answers:

What is a regular expression for parsing out individual sentences?

related questions