views:

294

answers:

1

I'm writing a lambda calculus interpreter for fun and practice. I got iostreams to properly tokenize identifiers by adding a ctype facet which defines punctuation as whitespace:

struct token_ctype : ctype<char> {
 mask t[ table_size ];
 token_ctype()
 : ctype<char>( t ) {
  for ( size_t tx = 0; tx < table_size; ++ tx ) {
   t[tx] = isalnum( tx )? alnum : space;
  }
 }
};

(classic_table() would probably be cleaner but that doesn't work on OS X!)

And then swap the facet in when I hit an identifier:

locale token_loc( in.getloc(), new token_ctype );
…
locale const &oldloc = in.imbue( token_loc );
in.unget() >> token;
in.imbue( oldloc );

There seems to be surprisingly little lambda calculus code on the Web. Most of what I've found so far is full of unicode λ characters. So I thought to try adding Unicode support.

But ctype<wchar_t> works completely differently from ctype<char>. There is no master table; there are four methods do_is x2, do_scan_is, and do_scan_not. So I did this:

struct token_ctype : ctype< wchar_t > {
 typedef ctype<wchar_t> base;

 bool do_is( mask m, char_type c ) const {
  return base::do_is(m,c)
  || (m&space) && ( base::do_is(punct,c) || c == L'λ' );
 }

 const char_type* do_is
  (const char_type* lo, const char_type* hi, mask* vec) const {
  base::do_is(lo,hi,vec);
  for ( mask *vp = vec; lo != hi; ++ vp, ++ lo ) {
   if ( *vp & punct || *lo == L'λ' ) *vp |= space;
  }
  return hi;
 }

 const char_type *do_scan_is
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) m |= punct;
  hi = do_scan_is(m,lo,hi);
  if ( m & space ) hi = find( lo, hi, L'λ' );
  return hi;
 }

 const char_type *do_scan_not
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) {
   m |= punct;
   while ( * ( lo = base::do_scan_not(m,lo,hi) ) == L'λ' && lo != hi )
    ++ lo;
   return lo;
  }
  return base::do_scan_not(m,lo,hi);
 }
};

(Apologies for the flat formatting; the preview converted the tabs differently.)

The code is WAY less elegant. I does better express the notion that only punctuation is additional whitespace, but that would've been fine in the original had I had classic_table.

Is there a simpler way to do this? Do I really need all those overloads? (Testing showed do_scan_not is extraneous here, but I'm thinking more broadly.) Am I abusing facets in the first place? Is the above even correct? Would it be better style to implement less logic?

A: 

IMHO the code you posted is fine. You could implement some of the methods using others if you wanted simpler code (maybe at the expense of efficiency), but the way you did it is OK.

The disparity is based on the fact that people don't want to have several megabyte tables in their UNICODE programs.

jpalecek