tags:

views:

118

answers:

3

This is a pretty simple question but I'm somewhat stumped.

I am capturing sections of text that match "xxxxxxxxxx". It works fine.

string pattern = "(?<quotePair>\"[^/\"]*\")";

Now I want to make a new pattern to capture “xxxxxxxxxx”... I used:

string pattern2 = "(?<lrquotePair>“[^/\"“]*”)";

For some reason the second pattern won't catch anything. What am I missing?

+1  A: 

Encoding might be getting in your way. Try with \u0093 and \u0094 instead.

badp
Thanks, I suspected something like this. Let me look at the link.
Alex Baranosky
I just tried it with string pattern2 = "(?<lrquotePair>\u0093[^/\"“]*\u0094)";It didn't seem to work, but hopefully I'm just tired... Is that regex correct by your eyes?
Alex Baranosky
Sometimes I really wish they'd made unicode less mind warping...
Matthew Scharley
try using "(?<lrquotePair>\u201c[^/\"\u201c\u201d]*\u201d)" as that’s what http://en.wikipedia.org/wiki/Quotation_mark_glyphs#Quotation_marks_in_electronic_documents says the code points are
cobbal
A: 

There's nothing wrong with your second regex. Are you sure the input string is correct? The characters you're trying to match are not plain ASCII, so maybe there's a problem with a character encoding mismatch.

Philippe Leybaert
+3  A: 

Your patterns are more complicated than how you describe them - for example, the first one won't match "foo/bar", and the second one won't match “foo/bar” or “foo"bar”. Perhaps your input falls into one of those categories?

If there is an encoding problem, it's not with the regex - .NET regexes support Unicode just fine. But it might be that you didn't read the text in the correct encoding in the first place - try printing it out and check that the fancy “” quotes are still there. In particular, if you use StreamReader class with a single-argument constructor (or File.OpenText helper), it defaults to UTF-8 encoding for input, which might not be what you actually have there.

Pavel Minaev
Thanks Pavel. Yeah, I meant for them to have those particular details in them. That isn't the problem. I can take the same piece of text, and merely switch " to “ and " to ”, and suddenly it won't match.
Alex Baranosky
I'm pretty tired. Seem to have missed the second paragraph. I think this may be my problem. I am using HtmlAgilityPack to output to a StringWriter. Now I understand why it is turning those characters into gobbledy-gook when it prints out.
Alex Baranosky
The web pages I am working with are charset="ISO-8859-1", StringWriter's encoding is UnicodeEncoding. I am reading the files into HtmlAgilityPack, then outputting them to a StringWriter. Could this be the problem? How might I rectify it?
Alex Baranosky
When I check the HtmlAgilityPack for the encoding it gives me: SBCSCodePageEncoding
Alex Baranosky
Anyone out there no how I can get around this?
Alex Baranosky
"SBCS" is simply "single-byte character set". It looks like it's a generic class covering all encodings that are based on Windows codepages. What does it give you if you query the encoding's CodePage property?
Pavel Minaev
1252 is the CodePage property value
Alex Baranosky
oops I had it messed up. The real encoding was Latin1, CodePage: 28591
Alex Baranosky