views:

483

answers:

2

I'm really tempted to drop RegexKit (or my own libpcre wrapper) into my project in order to do this, but before I do that I want to know how Cocoa developers manage to do half of this basic stuff without really convoluted code or without linking with RegexKit or another regular expression library.

I find it gobsmacking that Cocoa does not include any regular expression matching features. I've so accustomed to using regular expressions for all kinds of things that I'm lost without them. I can do what I need without them, but the code would be rather convoluted. So, Cocoa devs, I ask you, what's the "Cocoa way" to do this...

The problem is an everyday problem in programming as far as I'm concerned. Cocoa must have ways of doing this with the built-in features. Note that the position of the elements I want to match changes, and sometimes "quotes" are present. Whitespace is variable.

Take the following strings:

Content-Type: application/xml; charset=utf-8

Content-Type: text/html; charset="iso-8859-1"

Content-Type: text/plain;
 charset=us-ascii

Content-Type: text/plain; name="example.txt"; charset=utf-8

From all of these strings, how would you go about determining the mime type (e.g. text/plain) and the charset (e.g. utf-8) using just the built-in Cocoa classes?

I'd end up performing a series of -rangeOfString: and substring calls, with conditional checks to deal with the optional quotes etc. Is there a way to do this with NSScanner? The NSScanner class seems to have a pretty naive API to me.

Something like C's sscanf() that works for NSString objects would be an ideal fit. Most of my string parsing needs are simple such as this example so maybe regular expressions, while I'm accustomed to them, are overkill?

EDIT | The code is a bit long winded but it turns out NSScanner is actually quite easy to work with. It basically walks along your string doing as you tell it. The most annoying part of creating the NSCharacterSet instances it needs.

- (void)testNSScannerUseCase {
  NSString *testString = @"Content-type: application/xml; name=\"test\";\n charset=\"utf-8\"";

  unsigned int a = 'a', zero = '0';

  // There's probably a quicker way than to make these character sets this way
  NSMutableCharacterSet *alphaNumSet = [NSMutableCharacterSet characterSetWithRange:NSMakeRange(a, 26)];
  [alphaNumSet addCharactersInRange:NSMakeRange(zero, 10)];

  NSMutableCharacterSet *mimeTypeSet = [NSMutableCharacterSet characterSetWithCharactersInString:@"/-"];
  [mimeTypeSet formUnionWithCharacterSet:alphaNumSet];

  NSMutableCharacterSet *charsetSet = [NSMutableCharacterSet characterSetWithCharactersInString:@"-"];
  [charsetSet formUnionWithCharacterSet:alphaNumSet];

  // Initialize a case-insensitive scanner
  NSScanner *scanner = [NSScanner scannerWithString:testString];
  [scanner setCaseSensitive:NO];

  // Prepare to capture mime-type
  NSString *mimeType = nil;

  // Skip past the Content-Type: section
  if ([scanner scanUpToString:@":" intoString:NULL] && [scanner scanString:@":" intoString:NULL]) {
    [scanner scanCharactersFromSet:mimeTypeSet intoString:&mimeType];
  }

  GHAssertEqualStrings(@"application/xml", mimeType, @"Mime-type should be application/xml");

  // Prepare to look for the charset attribute
  NSString *charset = nil;

  // Ignore quotes as well as whitespace
  [scanner setCharactersToBeSkipped:[NSCharacterSet characterSetWithCharactersInString:@"\r\n\t \""]];

  // Skip past the charset attribute declaration
  if ([scanner scanUpToString:@"charset=" intoString:NULL]
    && [scanner scanString:@"charset=" intoString:NULL]) {

    [scanner scanCharactersFromSet:charsetSet intoString:&charset];
  }

  GHAssertEqualStrings(@"utf-8", charset, @"Charset should be utf-8");
}

This could be made a little smarter by using a while loop reading up to ";" then checking to see if it's the attribute I'm scanning for.

I dare say it benchmarks faster than using a regex and that my rather long code can be refactored down to something much smaller.

+2  A: 

I think you should go with your initial instinct. Use RegexKitLite. It's very small and simple to add to the project.

Another option, if this is for iPhone or iPad using iPhone OS 3.2, you can use the new NSRegularExpressionSearch option with -rangeOfCharacterFromSet:options:.

If I weren't going to use regular expressions, however, I would have a series of indexOf, rangeOf and substring calls. It'd probably only be half a dozen lines, but still not as simple and pretty as regular expressions.

Cory Kilger
Thanks, but surely picking chunks out of strings is such an everyday task that apple expect programmers to be able to do it cleanly without third-party frameworks? I'm guessing Apple don't rely on projects like RegexKit in their own code because (unless the problem is complex) there are already Cocoa ways to do this?
d11wtq
Actually, I think Apple does use RegExKit in at least some of their shipped code. However, there is a new `NSRegularExpression` class in iPhone OS 4.0 and I don't imagine it will be too long before it turns up in Mac OS X also. I agree that it's a major hole in the framework.
Rob Keniger
Thanks, since my regex needs are very minimal, I'd rather not wildly go bringing in an extra dependency (this is an open source project I want to keep relatively self-contained). I've been playing around with NSScanner and I'm discovering that this little beast is a lot more powerful than I first thought. I'll post a solution using NSScanner once I've played a bit more. If I start hitting more complex pattern-matching needs I'll definitely bring in an external framework. NSScanner is probably faster in any case.
d11wtq
+1  A: 

If these are HTTP Content-Type headers, technically, the second one is illegal according to my reading of RFC2616. You don't quote character set names. Having said that, you can't control your input and if you are getting them, you need to deal with them.

Anyway, assuming we are talking about HTTP headers, I'd be tempted to write a proper parser even if I did have a regex library to hand. Assuming you want to be a bit lazy, without a regex library or a parser, you need to do something like this:

  • Strip "Content-Length:".
  • Use -componentsSeparatedByString: to split at semicolons.

The mime type is first part trimmed of leading and trailing white space.

Now comes the tricky part. Iterate through each of the remaining components.

  • for the part you are on, make sure the semicolon you split on was not embedded in a string. The easiest way to do this is to count the number of unescaped double quote characters and make sure zero or two. If yuou did split on a quoted semicolon, join the next component back on and repeat
  • split at the = sign
  • if the first part is charset (case insensitive) you have found the found the one you are looking for. The second part is the actual character set - strip white spaces and enclosing double quotes.

The above is quite complex and there are probably edge cases it fails on, but then any regular expression you create to do the same will also be complex, have edge case failures, be unreadable and impossible to debug with the Xcode debugger.

JeremyP
You're right, the quotes should not be there, but unfortunately sometimes they are. I've edited my post and added an example using NSScanner. This feels a little less clumsy. In theory you could lexically analyze a string pretty well with the scanner. Much more long-winded than a quick regex, granted.
d11wtq
It doesn't look long winded to me. A third of your code just looks like setting up the character sets. NB there is a class method +alphanumericCharacterSet for alpha numerics, so you don't need to construct your own.
JeremyP
-alphanumericCharacterSet allows unicode with diacritic marks etc which made me decide to create my own, but in hindsight there's never going to be a scenario where it would pose an issue so I should just ditch that, you're right ;)
d11wtq