ansaurus

Question

Answer 1

+1 A:

Encoding might be getting in your way. Try with \u0093 and \u0094 instead.

badp 2009-07-12 08:40:42

Thanks, I suspected something like this. Let me look at the link.

Alex Baranosky 2009-07-12 08:42:48

I just tried it with string pattern2 = "(?<lrquotePair>\u0093[^/\"“]*\u0094)";It didn't seem to work, but hopefully I'm just tired... Is that regex correct by your eyes?

Alex Baranosky 2009-07-12 08:54:26

Sometimes I really wish they'd made unicode less mind warping...

Matthew Scharley 2009-07-12 09:04:26

try using "(?<lrquotePair>\u201c[^/\"\u201c\u201d]*\u201d)" as that’s what http://en.wikipedia.org/wiki/Quotation_mark_glyphs#Quotation_marks_in_electronic_documents says the code points are

cobbal 2009-07-12 09:08:28

Answer 2

A:

There's nothing wrong with your second regex. Are you sure the input string is correct? The characters you're trying to match are not plain ASCII, so maybe there's a problem with a character encoding mismatch.

Philippe Leybaert 2009-07-12 08:40:53

Answer 3

+3 A:

Your patterns are more complicated than how you describe them - for example, the first one won't match "foo/bar", and the second one won't match “foo/bar” or “foo"bar”. Perhaps your input falls into one of those categories?

If there is an encoding problem, it's not with the regex - .NET regexes support Unicode just fine. But it might be that you didn't read the text in the correct encoding in the first place - try printing it out and check that the fancy “” quotes are still there. In particular, if you use StreamReader class with a single-argument constructor (or File.OpenText helper), it defaults to UTF-8 encoding for input, which might not be what you actually have there.

Pavel Minaev 2009-07-12 08:46:29

Thanks Pavel. Yeah, I meant for them to have those particular details in them. That isn't the problem. I can take the same piece of text, and merely switch " to “ and " to ”, and suddenly it won't match.

Alex Baranosky 2009-07-12 08:51:27

I'm pretty tired. Seem to have missed the second paragraph. I think this may be my problem. I am using HtmlAgilityPack to output to a StringWriter. Now I understand why it is turning those characters into gobbledy-gook when it prints out.

Alex Baranosky 2009-07-12 09:13:01

The web pages I am working with are charset="ISO-8859-1", StringWriter's encoding is UnicodeEncoding. I am reading the files into HtmlAgilityPack, then outputting them to a StringWriter. Could this be the problem? How might I rectify it?

Alex Baranosky 2009-07-12 09:25:20

When I check the HtmlAgilityPack for the encoding it gives me: SBCSCodePageEncoding

Alex Baranosky 2009-07-12 09:44:21

Anyone out there no how I can get around this?

Alex Baranosky 2009-07-12 09:57:46

"SBCS" is simply "single-byte character set". It looks like it's a generic class covering all encodings that are based on Windows codepages. What does it give you if you query the encoding's CodePage property?

Pavel Minaev 2009-07-12 10:33:45

1252 is the CodePage property value

Alex Baranosky 2009-07-12 11:44:55

oops I had it messed up. The real encoding was Latin1, CodePage: 28591

Alex Baranosky 2009-07-12 11:51:11

ansaurus

tags:

views:

answers:

Capturing “xxxxxxxxxx”

related questions