tags:

views:

213

answers:

4

I am using VB .NET to write a program that will get the words from a suplied text file and count how many times each word appears. I am using this regular expression:-

parser As New Regex("\w+")

It gives me almost 100% correct words. Except when I have words like

"Ms Word App file name is word.exe." or "is this a c# statment If(a>b?1,0) ?"

In such cases I get [word & exe] AND [If, a, b, 1 and 0] as seperate words. it would be nice (for my purpose) that I received word.exe and (If(a>b?1,0) as words.

I guess \w+ looks for white space, sentence terminating punctuation mark and other punctuation marks to determine a word.

I want a similar regular Expression that will not break a word by a punctuation mark, if the punctuation mark is not the end of the word. I think end-of-word can be defined by a trailing WhiteSpace, Sentence terminating Punctuation (you may think of others). if you can suggest some regular expression 9for VB .NET) that will be great help.

Thanks.

A: 

If we assume that . with a space after it is a full stop then this regex should work

[\w(?!\S)\.]+
Hun1Ahpu
@Hun1Ahpu: This one is working best so far. Only problem is, for the words at the end of a sentence it is including the terminal FullStop(.), Questionmark (?) and Exclamation marks(!). If a word ends with a Comma(,) this RegEx is also including it. Like: "i like mango, orange and banana." will get "mango,", "orange," and "banana." as words. But would be perfect if i got "mango", "orange" and "banana".
Mehdi Anis
@Hun1Ahpu: (continued)this RegEx captured word.exe perfectly. and If(a>b?1,0) was captured as "If(a>b?1,0)." due to the terminating FullStop (.).I can manually traverse each word, Find+Omit trailing Comma, FullStop, Exclamation, Brackets etc. but if that can be handled by RegEx that would be 100% perfecrt for me. Thanks for the 'so far' best answer!
Mehdi Anis
`[\w(?!\S)\.]` is a character class that matches any one character that is: a word character (`\w`); a non-whitespace character (`\S`); or one of `(`, `?`, `!`, `)`, or `.`. If this regex works at all for you, @Mehdi, it's by accident; you'll get exactly the same results if you use `\S+`.
Alan Moore
@Alan Moore: YES! You are right. \S+ also gives the same result as [\w(?!\S)\.]+I will accept this answer as my solution as it is the closest to my need.
Mehdi Anis
A: 

Not a regular expression as such, but you could just do something like:

Dim words() As String = myString.Replace(". ", " ").Split(" "c)

(Code written from memory so probably won't compile exactly like that)

Edit: Realised that the code could be simplyfied.

ho1
@ho: Your solution doesn't cover sentences ending with "?" or "!". I will use replace+Split as the last resort.
Mehdi Anis
A: 

This expression has pretty good (although not perfect) results based on Expresso's default sample text:

((?:\w+[.\-!?#'])*\w+)(?=\s)
Damian Powell
This regEx didn't capture word.exe. i want word.exe as a word. It took 'statment' as the last word, doesnt include anything after that. So If(a>b?1,0) part is totally ignored. But I still want that part as a word. Thanks for the post.
Mehdi Anis
Hmmm. Sounds like I need to try harder!
Damian Powell
A: 

I tried to post my code on COMMENT section, but the it was too long for that. I am replying my own question by the ANSWER really came from Hun1Ahpu & Alan Moore.

I am pasting my code on how I am getting rid of trailing punctuation mark from a word.

Private mstrPunctuations As String = ",.'""`!@#$%^&*()_-+=?"
Dim parser As New Regex("\S+")
        Me.mintWordCount = parser.Matches(CleanedSource).Count
        For Each Word As Match In parser.Matches(CleanedSource)
            Dim NeedChange As Boolean = False
            For Each aChar As Char In Me.mstrPunctuations.ToCharArray()
                If Word.Value.EndsWith(aChar) Then
                    NeedChange = True
                    Exit For
                End If
            Next
            If NeedChange Then
                SetStringStat(Word.Value.Substring(0, Word.Value.Length - 1))
            Else
                SetStringStat(Word.Value)
            End If
        Next
Mehdi Anis