views:

378

answers:

4

Hi all, lets say I have a string that I want to split based on several characters, like ".", "!", and "?". How do I figure out which one of those characters split my string so I can add that same character back on to the end of the split segments in question?

    Dim linePunctuation as Integer = 0
    Dim myString As String = "some text. with punctuation! in it?"

    For i = 1 To Len(myString)
        If Mid$(entireFile, i, 1) = "." Then linePunctuation += 1
    Next

    For i = 1 To Len(myString)
        If Mid$(entireFile, i, 1) = "!" Then linePunctuation += 1
    Next

    For i = 1 To Len(myString)
        If Mid$(entireFile, i, 1) = "?" Then linePunctuation += 1
    Next

    Dim delimiters(3) As Char
    delimiters(0) = "."
    delimiters(1) = "!"
    delimiters(2) = "?"

    currentLineSplit = myString.Split(delimiters)

    Dim sentenceArray(linePunctuation) As String
    Dim count As Integer = 0

    While linePunctuation > 0

        sentenceArray(count) = currentLineSplit(count)'Here I want to add what ever delimiter was used to make the split back onto the string before it is stored in the array.'

        count += 1
        linePunctuation -= 1

    End While
A: 

.Split() does not provide this information.

You will need to use a Regular Expression to accomplish what you are after, which I infer as the desire to split an English-ish paragraph into sentences by splitting on punctuation.

The simplest implementation would look like this.

var input = "some text. with punctuation! in it?";
string[] sentences = Regex.Split(input, @"\b(?<sentence>.*?[\.!?](?:\s|$))");
foreach (string sentence in sentences)
{
    Console.WriteLine(sentence);
}

Results

some text.
with punctuation!
in it?

But you are going to find very quickly that language, as spoken/written by humans, does not follow simple rules most times.

Here it is in VB for ya:

Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")

Good luck.

Sky Sanders
Ah Mr. Sanders, I am actually doing this for you :) (because of the question about how to best translate large blocks of text). I am almost done but I couldn't get this portion to work right. Could you elaborate on how I can use Regex to get what I want?
typoknig
@typo - ok, i will give it a shot a bit later. in regards to the translation thing, trust me.. i have been down that road, long and hard, and come back with a bit of experience. translation via tokenization is a fools errand. ;-)
Sky Sanders
Well I'll go ahead and post my code to that other question. I works exactly how I thought it would when I use a file that has only sentences ending with periods, that is enough to prove that what I said was correct I think. I was just trying to get it to work with some other punctuation good too. Maybe you could swing by and un-downvote my answer :)
typoknig
@typo - getchur red hot regex. I understand what you are trying to do, although previously it seemed you were implying that tokenization would be done on a word boundary. Splitting on the sentence *may* give slightly less disjointed results. But this is assuming you are going English-->XXXX only. Try parsing Farsi '=)
Sky Sanders
A: 

Once you've called Split with all 3 characters, you've tossed that information away. You could do what you're trying to do by splitting yourself or by splitting on one punctuation mark at a time.

sblom
A: 

you can use LINQ

see this link for a nice example

Fredou
+2  A: 

If you add a capturing group to your regex like this:

SplitArray = Regex.Split(myString, "([.?!])")

Then the returned array contains both the text between the punctuation, and separate elements for each punctuation character. The Split() function in .NET includes text matched by capturing groups in the returned array. If your regex has several capturing groups, all their matches are included in the array.

This splits your sample into:

some text
.
 with punctuation
!
 in it
?

You can then iterate over the array to get your "sentences" and your punctuation.

Jan Goyvaerts
I couldn't get this to work (nor any of the other answers), but I will take your word for it and mess with it again when I have some more time. Thanks!
typoknig