tags:

views:

1155

answers:

5

I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.

It should be able to parse the following block of text into exactly 6 sentences:

Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.

Newlines should also be accepted. Numbers should not cause
sentence breaks, like 1.23.

This is proving a little more challenging than I originally thought.

Any help would be greatly appreciated. I am going to use this to train the system on known bodies of text.

+7  A: 

Try this @"(\S.+?[.!?])(?=\s+|$)"

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
    int i = match.Index;
    Console.WriteLine(match.Value);
}

Results

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

For Complicated one, Of course, you will need real parser like SharpNLP or NLTK. Mine is just quick and Dirty one.

Here is the SharpNLP Info, and Features

SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:

  • a sentence splitter
  • a tokenizer
  • a part-of-speech tagger
  • a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
  • a parser
  • a name finder
  • a coreference tool
  • an interface to the WordNet lexical database
S.Mark
+1 for pointing us to SharpNLP which I hadn't seen before and may be very useful.
peter.murray.rust
Better use a look-ahead assertion for `(?:\s+|$)`.
Gumbo
Thanks for info Gumbo, its better, but I had to add \S in the front, because whitespaces have to strip at left side.
S.Mark
Thanks everyone. This has been useful insight. I will try it out over the next few days.
Luke Machowski
+1  A: 

This is not really possible with only regular expressions, unless you know exactly which "difficult" tokens you have, such as "i.d.", "Mr.", etc. For example, how many sentences is "Please show your I.D, Mr. Bond."? I'm not familiar with any C#-implementations, but I've used NLTK's Punkt tokenizer. Probably should not be too hard to re-implement.

Alex Brasetvik
+1  A: 
var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

I tested this in LINQPad.

SLaks
Thanks for trying it out.
Luke Machowski
+3  A: 

It is impossible to use regexes to parse natural language. What is the end of a sentence? A period can occur in many places (e.g. e.g.). You should use a natural language parsing toolkit such as OpenNLP or NLTK. Unfortunately there are very few, if any, offerings in C#. You may therefore have to create a webservice or otherwise link into C#.

Note that it will cause problems in the future if you rely on exact whitespace as in "I.D.". You'll soon find examples that break your regex. For example most people put spaces after their intials.

There is an excellent summary of Open and commercial offerings in WP (http://en.wikipedia.org/wiki/Natural%5Flanguage%5Fprocessing%5Ftoolkits). We have used several of them. It's worth the effort.

[You use the word "train". This is normally associated with machine-learning (which is one approach to NLP and has been used for sentence-splitting). Indeed the toolkits I have mentioned include machine learning. I suspect that wasn't what you meant - rather that you would evolve your expression through heuristics. Don't!]

peter.murray.rust
THanks for that info. I am always intrigued in the machine-learning aspect of this and this is one aspect that I would like to investigate. For my current purpose, I actually think that the simple regex approach (where I don't expect these weird cases you speak of) to be just fine. However, I will try the frameworks you speak of because they already exist.
Luke Machowski
A: 

I have used the suggestions posted here and come up with the regex that seams to achieve what I want to do:

(?\S.+?(?[.!?]|\Z))(?=\s+|\Z)

I used a very fantastic tool for playing with (and interpretting) it. It's called Expresso. ( http://www.ultrapico.com/ )

This is what it says:

//  using System.Text.RegularExpressions;
/// <summary>
///  Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
///      \S.+?(?<Terminator>[.!?]|\Z)
///          Anything other than whitespace
///          Any character, one or more repetitions, as few as possible
///          [Terminator]: A named capture group. [[.!?]|\Z]
///              Select from 2 alternatives
///                  Any character in this class: [.!?]
///                  End of string or before new line at end of string
///  Match a suffix but exclude it from the capture. [\s+|\Z]
///      Select from 2 alternatives
///          Whitespace, one or more repetitions
///          End of string or before new line at end of string
///  
///
/// </summary>
public static Regex regex = new Regex(
      "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
    RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "$& [${Day}-${Month}-${Year}]";


//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);

//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);

//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);

//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);

//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);

//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();

//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

Thanks for everyones help!

Luke Machowski