views:

149

answers:

1

I am using RegexKitLite and I'm trying to match a pattern.
The following regex patterns do not capture my word that includes N with a titlde: ñ. Is there a string conversion I am missing?

subjectString = @"define_añadir";
//regexString = @"^define_(.*)"; //this pattern does not match, so I assume to add the ñ     
//regexString = @"^define_([.ñ]*)"; //tried this pattern first with a range
regexString = @"^define_((?:\\w|ñ)*)"; //tried second

NSString *captured= [subjectString stringByMatching:regexString capture:1L];
//I want captured == añadir
A: 

Looks like an encoding problem to me. Either you're saving the source code in an encoding that can't handle that character (like ASCII), or the compiler is using the wrong encoding to read the source files. Going back to the original regex, try creating the subject string like this:

subjectString = @"define_a\xC3\xB1adir";

or this:

subjectString = @"define_a\u00F1adir";

If that works, check the encoding of your source code files and make sure it's the same encoding the compiler expects.

EDIT: I've never worked with the iPhone technology stack, but according to this doc you should be using the stringWithUTF8String method to create the NSString, not the @"" literal syntax. In fact, it says you should never use non-ASCII characters (that is, anything not in the range 0x00..0x7F) in your code; that way you never have to worry about the source file's encoding. That's good advice no matter what language or toolset you're using.

Alan Moore
Correction: the example I posted does work - I simplified my code to keep it easy to read, but I may have more clues...My source code file .m is UTF8. I check with the unix command `file`. These string values are actually read from HTML files, which are also in UTF8. I have printed out the file contents with NSLog to reveal "xn--define_aadir-hhb" where I expect "define_añadir" to be read from the HTML into subjectString. Where may I check the encoding the compiler expects as you mentioned Alan? Also, not all of my source files I've found are UTF8, some are ASCII. May this be a problem?
ojreadmore
ASCII is a subset of UTF-8, so every ASCII file is also a UTF-8 file. As for the rest, see my edit.
Alan Moore