tags:

views:

185

answers:

5

so at the end the end(after few days of debuging) i found a problem. It isnt in regex at all :/ . It seams that i was trimming ekstra white spaces with

intput= Regex.Replace(input, "\\s+", " ");

so all new lines are replaced with " ". Stupid! Moderator, please remove this if unnecesary!

I have regexp for tokenizing some text and it looks like this :

"(?<html>Ç)|
(?<number>\\d+(?:[.]\\d+)?(?=[][ \f\n\r\t\v!?.,():;\"'„Ç]|$))|
(?<other>(?:[^][Ç \f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü][^ Ç\f\n\r\t\vA-Za-zčćšđžČĆŠĐŽäöÖü]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü](?=[][!?.,():;\"'„]*(?:$|[ Ç\f\n\r\t\v])))|
(?<word>(?:[^][ Ç\f\n\r\t\v!?.,():;\"'„][^ Ç\f\n\r\t\v]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„])|
(?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„])"

Problem is in this part: (?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„]). So when im prsing text with input "\n\n" it is grouping in punctuation matches: " "," " - in other words, space and space... and I don't know why?

+4  A: 

I could be wrong, but you need to hand the String as String to the RegEx...means you need to escape the backslashes.

... (?=[][ \\f\\n\\r\\t\\v!?.,():;\\" ...

Or otherwise C# will replace \n with a linebreak within the RegEx-Statement.

Edit: It's also possible to use literal strings, but the need to be marked with beginning @ (see Martin's answer).

Bobby
Or just use verbatim string literals: `@"..."`
Joey
Could a literal string not be used here? e.g. var regex = @"<regex>"
Mike
multiline regex, probably a verbatim string *already*.
Kobi
You three are right, it's also possible to use literal strings...I've edit my answer.
Bobby
+2  A: 

If you put an @ in front of string you can use single backslashes and line-breaks will be recognized.

 @"(?<html>Ç)|

Greetings, Martin

martin
thanks , il try this.
A: 

Set RegexOptions.IgnorePatternWhiteSpace

Update:

Are you sure [^] is correct? Unless it's somekind of character group (that I have never used), that will be the same as . . Same goes for []. Perhaps I just have not used all of RE before :p

leppie
[^]] and []] is correct! I read about how to include ] literal in [ ] and it says that "]" must be first literal in [ ] literal group. (after not if included ^)
RegexOptions.IgnorePatternWhitespace didnt help :/ so i cant find where i read about it but if you want to include symbol "]" in list of literals [abc] , u must put it on first place (i couldn escape that literal). so [ab]c]] is wrong [ab\]c] is wrong too. Right way is to put "]" at first place after start of literal group []abc] [^]abc] (if its a nogation). This works fine by me. So in []abc] literals are ] a b c .
Thanks unknown, didn't know that :)
leppie
A: 

So if i wasn't clear enought, im trying to parse sam input with that regex and in punctation group i have strings that are not in the input at first place. So that is not logical :/

input "\n\n" MatchCollection.Mathes.group["punctation"] = {" "," "} ascii values {32,32}, so i cant figure out where that came from ?

A: 

problem solved (edited first post). Sorry!